Protein Engineering, Vol. 14, No. 7, 459-463,
July 2001
© 2001 Oxford University Press
A numerical measure of amino acid residues similarity based on the analysis of their surroundings in natural protein sequences
Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Miklukho-Maklaya St. 16/10, GSP-7,Moscow 117997, Russia
| Abstract |
|---|
|
|
|---|
A measure of similarity between amino acid residues based on the analysis of the surroundings of each residue in primary structures of native proteins is proposed. The statistical data used for this purpose were obtained from the analysis of 168,808 protein sequences, which comprise the Protein Identification Research database (release 63). Using various threshold values of the proposed measure, amino acid residues were classified into several groups. The classification elaborated differs essentially from groupings previously used. The numerical measure of amino acid residues similarity can be used in site-directed mutagenesis studies for the prediction of probability of local spatial rearrangements in proteins.
Keywords: amino acid classification/database/protein sequences
| Introduction |
|---|
|
|
|---|
Nowadays, one of the main approaches in protein research is modification of proteins by genetic methods. The key element in these studies is the production of new recombinant proteins and their modification aimed at alteration of their biological activity. The method of point mutations is most frequently used for this purpose (Hurley et al., 1992
We believe it would be more correct to introduce a continuous numerical criterion based on the analysis of residue surroundings in protein primary structures. To disclose the correlation between the physical-chemical properties of the residues and their surrounding in the sequence is not only important for protein engineering, but could also be used for deducing the protein structure from the sequence. We consider that in most cases, similarity of residues surroundings reflects similar structural features of their local architecture. This must be apparent for a large set of sequences, where individual traits of the protein families are even. Accordingly, substitution of a given residue by another with similar surroundings is likely to result in preservation of the local spatial architecture. Thus, the numerical measure of the similarity surroundings of amino acid residues in primary structures of proteins would allow a classification of residues which differs from that which is currently used and, simultaneously a numerical criterion of influence of the amino acid substitutions upon the local structure. In this paper an attempt to introduce such a criterion is undertaken.
| Materials and methods |
|---|
|
|
|---|
Data
The statistical data were obtained using 168,808 native protein sequences included in the Protein Identification Research (PIR) database (release 63). The total number of amino acid residues in the sequences considered was 58,112,946. All available protein sequences without any preliminary selection were used as a primary data set. Such an approach allows one to eliminate specific features of individual primary structures and to reveal regularities, intrinsic to all native amino acid sequences.
Analysis of the PIR database
The study included the following stages: (i) reconstruction of averaged surroundings in protein sequences for each of 20 amino acid residues; (ii) determination of the characteristic length of a sequence segment with the most pronounced mutual influence of amino acid residues; and (iii) comparison of the amino acid residue surroundings in primary structures of native proteins.
First, the total number of all 400 pairs of amino acid residues separated by i peptide bonds [N(i)] was calculated. For each XZ pair, the values of its absolute NXZ(i) and relative cXZ(i) content in the database were determined:
![]() | (1) |
Let us consider the distribution of relative content of residue Z in the neighbourhood of residue X separated by 1 to n peptide bonds. Let i be positive if Z is closer to the C-terminus of polypeptide chain than X, and negative if otherwise. At the first stage, the value n was set to 55. In the given neighbourhood the average relative content of Z equals
![]() | (2) |
Let us consider the function dXZ(i), which represents the normalised deviation of the relative content of XZ pairs separated by i peptide bonds, from the average:
![]() | (3) |
This function can be interpreted as a distribution of relative content of Z (hereafter referred to as the distributed residue) in the neighbourhood of X (hereafter referred to as the central residue).
To evaluate the characteristic size of a sequence fragment, within which the pronounced difference of the content of pairs of amino acid residues from average values is observed, the value of the root mean square deviation s(i) from 0 in a sample of 400 dXZ values for all pairs of residues was used. Its distribution against i is as follows:
![]() | (4) |
The numerical measure of residues surroundings similarity was determined as follows. Let dX1Zk and dX2Zk be the known distributions of residues Zk in the neighbourhood of the residues X1 and X2, k = 1, 2, ... 20. Then, the sum of distances between vectors dX1Zk and dX2Zk is calculated as follows:
![]() | (5) |
The mean distance between vectors dXZ for all possible pairs of residues is calculated as follows:
![]() | (6) |
The following value has been introduced as a measure of similarity of environments of residues X1 and X2:
![]() | (7) |
Equipment and software
The calculations were made using the original software written in C++ for an IBM PC-compatible computer.
| Results and discussion |
|---|
|
|
|---|
The characteristic diagrams of distribution of dXZ against i are plotted in Figure 1
|
It should be noted that when the distributed and central residues are identical, the diagrams of distribution are symmetric. The form of the branches on the diagrams is close to exponential (Figures 1a and b
The diagram of s(i) is presented in Figure 2
. The result demonstrates that the most pronounced mutual influence of the residues is observed when the number of peptide bonds between them does not exceed 20. It should be noted that for a number of pairs (AA, RR and others) the mutual influence remains significant even on distances exceeding 50 peptide bonds between the residues. According to the data previously reported (Cserzo and Simon, 1989
), the maximal distance of mutual influence was determined to be about nine peptide bonds. It is noteworthy in this context that a local minimum in distribution of the root mean square deviation was observed at i = 9 (Figure 2
). Probably, it reflects a certain level of protein spatial organisation. The presence of this minimum has probably led the authors (Cserzo and Simon, 1989
) to make the conclusion about the primary role of interactions within short segments of polypeptide chain in formation of spatial structure of proteins.
|
The data suggest that the mutual influence of amino acid residues is not limited to the nearest neighbours, but extends across significant distances in a polypeptide chain. Therefore we used an interval from 1 to 20 peptide bonds for the comparison of surroundings of amino acid residues.
Values of m for all 400 possible pairs of residues were calculated according to Equations (2)
, (3)
, (5)
, (6)
and (7)
with n = 20. The corresponding numbers are shown in Table I
. According to Equation (6)
, m are nearer to 1 for those residues whose surroundings display a higher degree of similarity. The increase in dissimilarity of the residues surroundings corresponds to a decrease in m.
|
As was noted in the Introduction, the proposed measure of similarity of surroundings of the amino acid residues in primary structures of native proteins is a continuous numerical criterion. However, using various threshold values of m, it is possible to allocate groups of residues with an appropriate level of similarity of surroundings. The conformity of these groups of residues to earlier classifications is of particular interest. We have introduced the following threshold values of m: 0.4, 0.3, 0.2 and 0.1.
With the threshold value m = 0.4 only the string VIL and the ST pair can be detected among all residues. According to the earlier classification (Taylor, 1986
), the first three residues comprise a group of non-polar residues with aliphatic side chains. Also, these residues are grouped together in the classifications based on the analysis of amino acid substitutions (Bordo and Argos, 1991
; Murphy et al., 2000
). The main feature of these residues is the high degree of hydrophobicity.
High degree of similarity of surroundings of residues S and T can be accounted for by likeliness of structure and properties of their side chains: the small size and the ability to form hydrogen bonds are common for both residues. The similarity between the side chains of S and T has been noted in all classifications (Taylor, 1986
; Bordo and Argos, 1991
; Johnson and Overington, 1993
; Topham et al., 1997
; Murphy et al., 2000
). However these residues never formed the center of the separate group.
With the reduction of m threshold value to 0.3, the group of the hydrophobic residues incorporates F and Y. It should be noted that F appears to be closer to I (m = 0.380), rather than to Y (m = 0.352), according to the data obtained. Since the only difference between the chemical structures of F and Y is the presence of the hydroxyl group, it becomes obvious that it is the influence of such a group that results in differences in surroundings of these residues in primary structures. It is noteworthy that aromatic amino acid residues considerably differ from each other by their surroundings and cannot be allocated into a separate group. Also, they cannot be totally included in the group of hydrophobic residues. Concise differentiation of hydrophobic residues from others is in good agreement with the suggestions about a leading role of hydrophobic interactions in the folding of a polypeptide chain (Pace, 1992
; Rose and Wolfenden, 1993
).
With the threshold value m = 0.3, the string KRQ (and E with threshold value m = 0.2) and the DN pair emerge. The common property of both groups of residues is the ability to form hydrogen bonds. The major factor causing separation of these residues into two different groups is the size of the side chain. In this case low similarities between surroundings of D and E (m = 0.109), and N and Q (m = 0.161) are of particular interest. This could be accounted for by the major role of side chain size rather than similarity of side chain functionalities in the folding process.
Reduction of the threshold value m to 0.2 leads to the emergence of residues A, V, D and E in the nearest neighbourhood of residues S and T. All above-mentioned residues were assigned to the group of so-called `residues with a small size of side chains' in earlier classifications. Cysteine residue (C) was included in the same group. However, on the basis of present data, C has a unique environment and cannot be included in any group.
The influence of the ß-methyl group upon the value of similarity of amino acid residues surroundings is revealed by example of residues S and T. The presence of this group results in a higher degree of similarity in surroundings for a TV pair (m = 0.247) as compared to SV (m = 0.129). Thus, T takes an intermediate position between highly hydrophilic S and highly hydrophobic V. Similar differences are observed for pairs SA and TA, SI and TI, and others. It should be noted that the presence of beta methyl in V, I and T does not result in their allocation into a separate group.
With the threshold value m = 0.1, S and T have the greatest numbers of neighbours on the diagram (9 and 7, respectively). This multitude includes P, G and H, the surroundings of which have the least degrees of similarity with the surroundings of other residues. This fact suggests that S and T may substitute most residues in protein molecules with minimal effect upon local 3D structure.
Finally, there is a number of the amino acid residues (M, C, P, G, H and W), the surroundings of which have the least degree of similarity with the surroundings of other residues. Their uniqueness reflects the special role of these residues in formation of a protein structure. Thus, M is the leader residue almost in all native polypeptides. Residues P and G have allowed areas for torsion angles of the backbone, which differ essentially from those of other residues because of the unique organisation of proline side chain and the absence of side chain for glycine. Cysteine residues can form covalent bonds with distant segments of polypeptide chain. The tryptophane residue has the largest side chain, so its arrangement imposes specific requirements on the nearest neighbourhood. The side chain of H can participate in proton relay: this residue is frequently present in catalytic sites of enzymes. Substitution of any of these residues by any other is likely to result in disturbance of the local 3D structure of a protein.
With the threshold value m < 0.1, an overlap between groups of the residues is observed. Accordingly, consideration of lower levels of similarity of the surroundings is inexpedient.
In this study, a universal numerical measure of amino acid residues similarity based on the analysis of similarities of their surroundings in native protein sequences is elaborated. The classification of residues, based on this criterion, reveals essential differences from earlier classifications.
Similarity of chemical structure of side chains, such as aromaticity or presence of identical functional groups, has been demonstrated to be insufficient for allocation of the residues into groups, whereas the size of side chain can be foundational for such classification.
The concise differentiation of hydrophobic residues from others shows that hydrophobicity is the most important parameter of the amino acid residues, which influences the formation of 3D structure of protein.
Six amino acid residues having unique surroundings are revealed. The substitution of any of them by any other residue is likely to result in principle changes in local 3D organisation of a protein molecule with a high degree of probability.
The obtained results suggest that the criterion elaborated reflects structural features of amino acid residues. Thus, the proposed criterion as well as data about the environment of residues can be applied to evaluation of influence of amino acid substitutions on a 3D structure of proteins in studies utilising site-directed mutagenesis.
|
| Notes |
|---|
1 To whom correspondence should be addressed.E-mail: alexei_nekrasov{at}mail.ru
| Acknowledgments |
|---|
The authors are grateful to Ivan Yudushkin for his help in preparation of this manuscript.
| References |
|---|
|
|
|---|
Bordo,D. and Argos,P. (1991) J. Mol. Biol., 217, 721729.[ISI][Medline]
Cserzo,M. and Simon,I. (1989) Int. J. Pept. Protein Res., 34, 184195.[ISI][Medline]
Hurley,J.H., Baase,W.A. and Matthews,B.H. (1992) J. Mol. Biol., 224, 11431159.[ISI][Medline]
Janin,J. (1979) Nature, 277, 491492.[Medline]
Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716738.[ISI][Medline]
Kyte,J. and Doolite,R. (1982) J. Mol. Biol., 157, 105132.[ISI][Medline]
Lim,W.A., Farruggio,D.C. and Sauer,R.T. (1992) Biochemistry, 31, 43244333.[Medline]
Murphy,L.R., Wallqvist,A. and Levy,R.M. (2000) Protein Eng., 13, 149152.
Pace,C.N. (1992) J. Mol. Biol., 226, 2935.[ISI][Medline]
Poroykov,V.V., Esipova,N.G. and Tumanyan,V.G. (1976) Mol. Biophys. (Moscow), 21, 397400 (in Russian).
Rose,G., Geselowitz,A., Lesser,G., Lee,R. and Zehfus,M. (1985) Science, 229, 834838.
Rose,G.D. and Wolfenden,R. (1993) Annu. Rev. Biophys. Biomol. Struct., 22, 381409.[ISI][Medline]
Taylor,W.R. (1986) J. Mol. Biol., 188, 233258.[ISI][Medline]
Topham,C.M., Srinivasan,N. and Blundell,T.L. (1997) Protein Eng., 10, 721.
Wolfenden,R., Andersson,L., Cullis,P. and Southgate,C. (1981) Biochemistry, 20, 849855.[Medline]
Zhang,X.-J., Baase,W.A. and Matthews,B.W. (1992) Protein Sci., 1, 761776.[Abstract]
Received August 21, 2000; revised February 23, 2001; accepted March 12, 2001.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Figureau, M.A. Soto, and J. Toha A pentapeptide-based method for protein secondary structure prediction Protein Eng. Des. Sel., February 1, 2003; 16(2): 103 - 107. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










0.1 are depicted.