Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rogov, S. I.
Right arrow Articles by Nekrasov, A. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rogov, S. I.
Right arrow Articles by Nekrasov, A. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Protein Engineering, Vol. 14, No. 7, 459-463, July 2001
© 2001 Oxford University Press

A numerical measure of amino acid residues similarity based on the analysis of their surroundings in natural protein sequences

Sergey I. Rogov and Alexei N. Nekrasov,1

Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Miklukho-Maklaya St. 16/10, GSP-7,Moscow 117997, Russia


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
A measure of similarity between amino acid residues based on the analysis of the surroundings of each residue in primary structures of native proteins is proposed. The statistical data used for this purpose were obtained from the analysis of 168,808 protein sequences, which comprise the Protein Identification Research database (release 63). Using various threshold values of the proposed measure, amino acid residues were classified into several groups. The classification elaborated differs essentially from groupings previously used. The numerical measure of amino acid residues similarity can be used in site-directed mutagenesis studies for the prediction of probability of local spatial rearrangements in proteins.

Keywords: amino acid classification/database/protein sequences


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Nowadays, one of the main approaches in protein research is modification of proteins by genetic methods. The key element in these studies is the production of new recombinant proteins and their modification aimed at alteration of their biological activity. The method of point mutations is most frequently used for this purpose (Hurley et al., 1992Go; Lim et al., 1992Go; Zhang et al., 1992Go). It is often of crucial importance to preserve the structure of the modified protein akin to its native conformation while altering its substrate specificity or affinity to the regulatory factors. As a rule, the experimental confirmation of the equivalence between three-dimensional structures of the native and recombinant proteins is time-consuming. Hence, the necessity arises to predict the influence of amino acid substitutions upon protein structure. Various types of classifications of amino acid residues are now used to solve this problem (Johnson and Overington, 1986; Taylor, 1986Go; Bordo and Argos, 1991Go; Topham et al., 1997Go; Murphy et al., 2000Go). The approaches used fall into two major types: the first is based on measurement or evaluation of various physical-chemical properties of amino acid residues (Taylor, 1986Go); the second on the analysis of amino acid substitutions in families of evolutionary related proteins (Bordo and Argos, 1991Go; Topham et al., 1997Go; Murphy et al., 2000Go). In our opinion, both approaches suffer from inherent drawbacks. A certain degree of arbitrariness in selection of physical-chemical properties of the residues and methods of their determination is inherent to the first of the above-mentioned approaches. Thus, various authors (Janin, 1979Go; Wolfenden et al., 1981Go; Kyte and Doolite, 1982Go; Rose et al., 1985Go) report the residue hydrophobicity classifications, which differ considerably from each other. The main disadvantage of methods based on comparison of the frequency of the amino acid substitutions is that the probability of substitution of a given residue depends on its role in protein structure or function. Since various families of proteins have different folds, the probability of substitution of a given residue for any other will vary for different families. Thus, classifications of the residues, based on such an approach, will depend on what family of proteins was analysed.

We believe it would be more correct to introduce a continuous numerical criterion based on the analysis of residue surroundings in protein primary structures. To disclose the correlation between the physical-chemical properties of the residues and their surrounding in the sequence is not only important for protein engineering, but could also be used for deducing the protein structure from the sequence. We consider that in most cases, similarity of residues surroundings reflects similar structural features of their local architecture. This must be apparent for a large set of sequences, where individual traits of the protein families are even. Accordingly, substitution of a given residue by another with similar surroundings is likely to result in preservation of the local spatial architecture. Thus, the numerical measure of the similarity surroundings of amino acid residues in primary structures of proteins would allow a classification of residues which differs from that which is currently used and, simultaneously a numerical criterion of influence of the amino acid substitutions upon the local structure. In this paper an attempt to introduce such a criterion is undertaken.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Data

The statistical data were obtained using 168,808 native protein sequences included in the Protein Identification Research (PIR) database (release 63). The total number of amino acid residues in the sequences considered was 58,112,946. All available protein sequences without any preliminary selection were used as a primary data set. Such an approach allows one to eliminate specific features of individual primary structures and to reveal regularities, intrinsic to all native amino acid sequences.

Analysis of the PIR database

The study included the following stages: (i) reconstruction of averaged surroundings in protein sequences for each of 20 amino acid residues; (ii) determination of the characteristic length of a sequence segment with the most pronounced mutual influence of amino acid residues; and (iii) comparison of the amino acid residue surroundings in primary structures of native proteins.

First, the total number of all 400 pairs of amino acid residues separated by i peptide bonds [N(i)] was calculated. For each X–Z pair, the values of its absolute NXZ(i) and relative cXZ(i) content in the database were determined:

(1)

Let us consider the distribution of relative content of residue Z in the neighbourhood of residue X separated by 1 to n peptide bonds. Let i be positive if Z is closer to the C-terminus of polypeptide chain than X, and negative if otherwise. At the first stage, the value n was set to 55. In the given neighbourhood the average relative content of Z equals

(2)

Let us consider the function dXZ(i), which represents the normalised deviation of the relative content of X–Z pairs separated by i peptide bonds, from the average:

(3)

This function can be interpreted as a distribution of relative content of Z (hereafter referred to as the distributed residue) in the neighbourhood of X (hereafter referred to as the central residue).

To evaluate the characteristic size of a sequence fragment, within which the pronounced difference of the content of pairs of amino acid residues from average values is observed, the value of the root mean square deviation s(i) from 0 in a sample of 400 dXZ values for all pairs of residues was used. Its distribution against i is as follows:

(4)

The numerical measure of residues surroundings similarity was determined as follows. Let dX1Zk and dX2Zk be the known distributions of residues Zk in the neighbourhood of the residues X1 and X2, k = 1, 2, ... 20. Then, the sum of distances between vectors dX1Zk and dX2Zk is calculated as follows:

(5)

The mean distance between vectors dXZ for all possible pairs of residues is calculated as follows:

(6)

The following value has been introduced as a measure of similarity of environments of residues X1 and X2:

(7)

Equipment and software

The calculations were made using the original software written in C++ for an IBM PC-compatible computer.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The characteristic diagrams of distribution of dXZ against i are plotted in Figure 1Go. The common feature of these distributions irrespective of any given X–Z pair is the decrease in variance of the relative content of the distributed residue with the increase of the number of peptide bonds between the residues.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1. . Distribution of the relative content d of one amino acid residue in the neigbourhood of another against the number of peptide bonds between them (i). (a) –A in the neighbourhood of A; (b) –R in the neighbourhood of R; (c) –P in the neighbourhood of D.

 
It should be noted that when the distributed and central residues are identical, the diagrams of distribution are symmetric. The form of the branches on the diagrams is close to exponential (Figures 1a and bGo). These results are in good agreement with the previously reported data (Poroykov et al., 1976Go) on increased probability of a common grouping of identical residues in a polypeptide chain. When the central and distributed residues are different, the form of the diagram may differ significantly from exponential (Figure 1cGo).

The diagram of s(i) is presented in Figure 2Go. The result demonstrates that the most pronounced mutual influence of the residues is observed when the number of peptide bonds between them does not exceed 20. It should be noted that for a number of pairs (A–A, R–R and others) the mutual influence remains significant even on distances exceeding 50 peptide bonds between the residues. According to the data previously reported (Cserzo and Simon, 1989Go), the maximal distance of mutual influence was determined to be about nine peptide bonds. It is noteworthy in this context that a local minimum in distribution of the root mean square deviation was observed at i = 9 (Figure 2Go). Probably, it reflects a certain level of protein spatial organisation. The presence of this minimum has probably led the authors (Cserzo and Simon, 1989Go) to make the conclusion about the primary role of interactions within short segments of polypeptide chain in formation of spatial structure of proteins.



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 2. . Dependence of s(i) (Equation 4Go) on the number of peptide bonds i between residues.

 
The data suggest that the mutual influence of amino acid residues is not limited to the nearest neighbours, but extends across significant distances in a polypeptide chain. Therefore we used an interval from 1 to 20 peptide bonds for the comparison of surroundings of amino acid residues.

Values of m for all 400 possible pairs of residues were calculated according to Equations (2)Go, (3)Go, (5)Go, (6)Go and (7)Go with n = 20. The corresponding numbers are shown in Table IGo. According to Equation (6)Go, m are nearer to 1 for those residues whose surroundings display a higher degree of similarity. The increase in dissimilarity of the residues surroundings corresponds to a decrease in m.


View this table:
[in this window]
[in a new window]
 
Table I. . Values of similarity of the amino acid residues surroundings (m)
 
As was noted in the Introduction, the proposed measure of similarity of surroundings of the amino acid residues in primary structures of native proteins is a continuous numerical criterion. However, using various threshold values of m, it is possible to allocate groups of residues with an appropriate level of similarity of surroundings. The conformity of these groups of residues to earlier classifications is of particular interest. We have introduced the following threshold values of m: 0.4, 0.3, 0.2 and 0.1.

With the threshold value m = 0.4 only the string V–I–L and the S–T pair can be detected among all residues. According to the earlier classification (Taylor, 1986Go), the first three residues comprise a group of non-polar residues with aliphatic side chains. Also, these residues are grouped together in the classifications based on the analysis of amino acid substitutions (Bordo and Argos, 1991Go; Murphy et al., 2000Go). The main feature of these residues is the high degree of hydrophobicity.

High degree of similarity of surroundings of residues S and T can be accounted for by likeliness of structure and properties of their side chains: the small size and the ability to form hydrogen bonds are common for both residues. The similarity between the side chains of S and T has been noted in all classifications (Taylor, 1986Go; Bordo and Argos, 1991Go; Johnson and Overington, 1993Go; Topham et al., 1997Go; Murphy et al., 2000Go). However these residues never formed the center of the separate group.

With the reduction of m threshold value to 0.3, the group of the hydrophobic residues incorporates F and Y. It should be noted that F appears to be closer to I (m = 0.380), rather than to Y (m = 0.352), according to the data obtained. Since the only difference between the chemical structures of F and Y is the presence of the hydroxyl group, it becomes obvious that it is the influence of such a group that results in differences in surroundings of these residues in primary structures. It is noteworthy that aromatic amino acid residues considerably differ from each other by their surroundings and cannot be allocated into a separate group. Also, they cannot be totally included in the group of hydrophobic residues. Concise differentiation of hydrophobic residues from others is in good agreement with the suggestions about a leading role of hydrophobic interactions in the folding of a polypeptide chain (Pace, 1992Go; Rose and Wolfenden, 1993Go).

With the threshold value m = 0.3, the string K–R–Q (and E with threshold value m = 0.2) and the D–N pair emerge. The common property of both groups of residues is the ability to form hydrogen bonds. The major factor causing separation of these residues into two different groups is the size of the side chain. In this case low similarities between surroundings of D and E (m = 0.109), and N and Q (m = 0.161) are of particular interest. This could be accounted for by the major role of side chain size rather than similarity of side chain functionalities in the folding process.

Reduction of the threshold value m to 0.2 leads to the emergence of residues A, V, D and E in the nearest neighbourhood of residues S and T. All above-mentioned residues were assigned to the group of so-called `residues with a small size of side chains' in earlier classifications. Cysteine residue (C) was included in the same group. However, on the basis of present data, C has a unique environment and cannot be included in any group.

The influence of the ß-methyl group upon the value of similarity of amino acid residues surroundings is revealed by example of residues S and T. The presence of this group results in a higher degree of similarity in surroundings for a T–V pair (m = 0.247) as compared to S–V (m = 0.129). Thus, T takes an intermediate position between highly hydrophilic S and highly hydrophobic V. Similar differences are observed for pairs S–A and T–A, S–I and T–I, and others. It should be noted that the presence of beta methyl in V, I and T does not result in their allocation into a separate group.

With the threshold value m = 0.1, S and T have the greatest numbers of neighbours on the diagram (9 and 7, respectively). This multitude includes P, G and H, the surroundings of which have the least degrees of similarity with the surroundings of other residues. This fact suggests that S and T may substitute most residues in protein molecules with minimal effect upon local 3D structure.

Finally, there is a number of the amino acid residues (M, C, P, G, H and W), the surroundings of which have the least degree of similarity with the surroundings of other residues. Their uniqueness reflects the special role of these residues in formation of a protein structure. Thus, M is the leader residue almost in all native polypeptides. Residues P and G have allowed areas for torsion angles of the backbone, which differ essentially from those of other residues because of the unique organisation of proline side chain and the absence of side chain for glycine. Cysteine residues can form covalent bonds with distant segments of polypeptide chain. The tryptophane residue has the largest side chain, so its arrangement imposes specific requirements on the nearest neighbourhood. The side chain of H can participate in proton relay: this residue is frequently present in catalytic sites of enzymes. Substitution of any of these residues by any other is likely to result in disturbance of the local 3D structure of a protein.

With the threshold value m < 0.1, an overlap between groups of the residues is observed. Accordingly, consideration of lower levels of similarity of the surroundings is inexpedient.

In this study, a universal numerical measure of amino acid residues similarity based on the analysis of similarities of their surroundings in native protein sequences is elaborated. The classification of residues, based on this criterion, reveals essential differences from earlier classifications.

Similarity of chemical structure of side chains, such as aromaticity or presence of identical functional groups, has been demonstrated to be insufficient for allocation of the residues into groups, whereas the size of side chain can be foundational for such classification.

The concise differentiation of hydrophobic residues from others shows that hydrophobicity is the most important parameter of the amino acid residues, which influences the formation of 3D structure of protein.

Six amino acid residues having unique surroundings are revealed. The substitution of any of them by any other residue is likely to result in principle changes in local 3D organisation of a protein molecule with a high degree of probability.

The obtained results suggest that the criterion elaborated reflects structural features of amino acid residues. Thus, the proposed criterion as well as data about the environment of residues can be applied to evaluation of influence of amino acid substitutions on a 3D structure of proteins in studies utilising site-directed mutagenesis.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 3. . Classification of amino acid residues according to similarity values of their surroundings. Darker lines join residues with higher degrees of similarity. Only residues with similarity score m >= 0.1 are depicted.

 

    Notes
 
1 To whom correspondence should be addressed.E-mail: alexei_nekrasov{at}mail.ru Back


    Acknowledgments
 
The authors are grateful to Ivan Yudushkin for his help in preparation of this manuscript.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Bordo,D. and Argos,P. (1991) J. Mol. Biol., 217, 721–729.[Web of Science][Medline]

Cserzo,M. and Simon,I. (1989) Int. J. Pept. Protein Res., 34, 184–195.[Web of Science][Medline]

Hurley,J.H., Baase,W.A. and Matthews,B.H. (1992) J. Mol. Biol., 224, 1143–1159.[Web of Science][Medline]

Janin,J. (1979) Nature, 277, 491–492.[Medline]

Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716–738.[Web of Science][Medline]

Kyte,J. and Doolite,R. (1982) J. Mol. Biol., 157, 105–132.[Web of Science][Medline]

Lim,W.A., Farruggio,D.C. and Sauer,R.T. (1992) Biochemistry, 31, 4324–4333.[Medline]

Murphy,L.R., Wallqvist,A. and Levy,R.M. (2000) Protein Eng., 13, 149–152.[Abstract/Free Full Text]

Pace,C.N. (1992) J. Mol. Biol., 226, 29–35.[Web of Science][Medline]

Poroykov,V.V., Esipova,N.G. and Tumanyan,V.G. (1976) Mol. Biophys. (Moscow), 21, 397–400 (in Russian).

Rose,G., Geselowitz,A., Lesser,G., Lee,R. and Zehfus,M. (1985) Science, 229, 834–838.[Abstract/Free Full Text]

Rose,G.D. and Wolfenden,R. (1993) Annu. Rev. Biophys. Biomol. Struct., 22, 381–409.[Web of Science][Medline]

Taylor,W.R. (1986) J. Mol. Biol., 188, 233–258.[Web of Science][Medline]

Topham,C.M., Srinivasan,N. and Blundell,T.L. (1997) Protein Eng., 10, 7–21.[Abstract/Free Full Text]

Wolfenden,R., Andersson,L., Cullis,P. and Southgate,C. (1981) Biochemistry, 20, 849–855.[Medline]

Zhang,X.-J., Baase,W.A. and Matthews,B.W. (1992) Protein Sci., 1, 761–776.[Web of Science][Medline]

Received August 21, 2000; revised February 23, 2001; accepted March 12, 2001.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Protein Eng Des SelHome page
A. Figureau, M.A. Soto, and J. Toha
A pentapeptide-based method for protein secondary structure prediction
Protein Eng. Des. Sel., February 1, 2003; 16(2): 103 - 107.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rogov, S. I.
Right arrow Articles by Nekrasov, A. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rogov, S. I.
Right arrow Articles by Nekrasov, A. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?