Protein Engineering, Vol. 13, No. 3, 149-152,
March 2000
© 2000 Oxford University Press
Simplified amino acid alphabets for protein fold recognition and implications for folding
Department of Chemistry, Rutgers University, Wright-Rieman Laboratories, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
| Abstract |
|---|
|
|
|---|
Protein design experiments have shown that the use of specific subsets of amino acids can produce foldable proteins. This prompts the question of whether there is a minimal amino acid alphabet which could be used to fold all proteins. In this work we make an analogy between sequence patterns which produce foldable sequences and those which make it possible to detect structural homologs by aligning sequences, and use it to suggest the possible size of such a reduced alphabet. We estimate that reduced alphabets containing 1012 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made, but that this information is rapidly degraded when further reductions in the alphabet are made.
Keywords: minimal alphabet/protein fold recognition/sequence alignment
| Introduction |
|---|
|
|
|---|
A cell requires a large number of different proteins to execute and regulate cellular processes. Even though the structures of these individual proteins are highly complex and diverse on the atomic level, it is believed that there exists a finite number of protein folds. The constituent building blocks of these proteins are the 20 naturally occurring amino acids. It is from this set of amino acids that polypeptide chains are formed in the cell, which in turn rapidly fold into well-defined three-dimensional structures. From a combinatorial standpoint there is an almost endless variety of sequences that can be made from a 20-letter code, e.g. for a polypeptide chain of length 100 there are 20100 possible combinations. Of course, only a fraction of these chains can find unique and stable folds (Levinthal et al., 1975
-helical bundles, the recent work of Riddle et al. (1997) showed that for a small 57-residue ß-barrel-like protein, 38 out of 40 targeted amino acids could be reduced to just a handful of residues. A combinatorial chemistry approach allowed the experimentalists to sample a wide range of possible mutations that could code for both foldability and function. Two fully functional constructs were detected in which 38 out of 40 selected sites mutated into five residues, I, K, E, A and G. These sequences yielded rapidly folding proteins that were viable in vitro. Taken together, these experimental results suggest that there may exist reduced amino acid alphabets which could be used to fold many proteins by making appropriate substitutions in the original sequences. It is difficult to test this hypothesis directly on a sufficiently large number of proteins representative of the known folding families, but we can obtain insights into this problem by studying sequence patterns which characterize these protein folding families, and the effects of alphabet reduction on these sequence patterns.
A central problem of structural genomics is to predict the folding family given a newly sequenced gene. In many instances this can be accomplished by aligning the `query' sequence against a database of sequences which have been clustered into folding families according to structural criteria. We have analyzed how this sequence-based fold recognition procedure depends on the size of the amino acid alphabet from which the sequences were constructed. As discussed below, we observe the fold recognition is minimally degraded when the amino acid alphabet is reduced from 20 to 10 letters by appropriately grouping chemically similar amino acids, but the sequence-encoded information needed to differentiate folding families is rapidly degraded when further reductions in the alphabet are made.
The analogy between protein folding and protein fold recognition is, of course, incomplete. A sequence which has no detectable homology within a clustered database of proteins may well fold to a structure whose family is represented in the database divergent evolution has produced many such examples. In this sense an estimate of the minimal alphabet size based on homology detection by sequence alignments represents an upper bound. Conversely, there is no guarantee that a synthetic sequence will actually fold even if its alignment scores against sequences in a particular family are very much larger than can be expected by chance there could be kinetic barriers to the folding of such sequences, for example. Yet it is intriguing to suggest that if it is possible to construct representative sequences from a large number of different folding families which will fold using a reduced set of amino acids (a reduced alphabet), then the alignment scores of the corresponding sequence pairs will be very much larger than expected by chance. That is to say, if the different sequence patterns that encode for a diversity of folding families can be preserved when the sequences are synthesized using reduced alphabets, then it should be possible to probe this relationship by carrying out computer alignment experiments on sequences constructed with reduced alphabets and comparing these results with those for the parent native sequences.
In this work we evaluate the extent to which sequence patterns derived from reduced alphabets preserve the information needed to detect homologs in a clustered database. The amino acid reduction scheme is based on the analysis of correlations among similarity matrix elements used for sequence alignments. We find that as the alphabet size is reduced, the information encoded in the amino acid sequences responsible for protein fold recognition is degraded. We estimate that for proteins of many different families a minimal alphabet requires 1012 letters.
| Materials and methods |
|---|
|
|
|---|
The alignments between protein sequences were performed using the global alignment algorithm of Myers and Miller (1988) as coded in the FASTA program package suite by Pearson (1990). The BLOSUM50 matrix derived by Henikoff and Henikoff (1992) was used for alignments based on the 20-letter alphabet and as the starting point for constructing the reduced similarity matrices. The gap insertion and elongation parameters used for alignments were set to 12/2.
The 20-letter amino acid alphabet is reduced to smaller alphabets based on correlations indicated by the BLOSUM50 similarity matrix, i.e. amino acid pairs with high similarity scores are grouped together. The procedure for grouping like amino acids together is as follows: first, the correlation coefficients between similarity matrix elements are calculated for all pairs of amino acids, i.e. for alanine (A) and valine (V) the coefficient would be evaluated as
![]() |
Reduction of the similarity matrix based on the groupings is performed by calculating new matrix elements as the average of the appropriate old similarity matrix elements. For example, the score between a group consisting of (A) and one consisting of (ST) is computed as the average of the AS and AT terms. Thus whereas in the original similarity matrix an alignment of A with S contributes MAS to the overall alignment score, in the new similarity matrix the contribution is M12 =(MAS + MAT + MSA + MTA)/4, and alignment of A with T is equivalent to that of A with S.
Alphabet reductions derived from the similarity matrix are shown in Figure 1
. The complete group of reduced alphabets studied in addition to those delineated in the figure are as follows: 3 letters, [(LASGVTIPMC), (EKRDNQH), (FYW)]; 5 letters, [(LVIMC), (ASGTP), (FYW), (EDNQ), (KRH)]; 6 letters, [(LVIM), (ASGT), (PHC), (FYW), (EDNQ), (KR)]; 12 letters, [(LVIM), (C), (A), (G), (ST), (P), (FY), (W), (EQ), (DN), (KR), (H)]; and 18 letters, [(LM), (VI), (C), (A), (G), (S), (T), (P), (F), (Y), (W), (E), (D), (N), (Q), (K), (R), (H)]. These groupings are similar to those previously proposed from examining amino acid side-chain properties (Miyata et al., 1979
; Santibanez and Rohde, 1987
) and other similarity matrices (Collins and Coulson, 1987
; Risler et al., 1988
; Landes and Risler, 1994
).
|
Homology detection with reduced alphabets was tested using `all-against-all' alignments of sequences within the SCOP40 database, extracted by Brenner et al. (1998) from SCOP (Murzin et al., 1995
|
| Results |
|---|
|
|
|---|
Many schemes for constructing reduced alphabets based on chemical similarity have been proposed. The specifics of one reduced alphabet for a 57-residue src SH3 domain were obtained by Riddle et al. (1997) using a phage display selection strategy. This protein has a ß-barrel-like structure and provides a binding site for proline-rich peptides (Feng et al., 1994
In this work we formulate an amino acid reduction scheme based on the analysis of correlations among similarity matrix elements used for sequence alignments. Our procedure averages the matrix elements of the most closely related residues, and constructs reduced similarity matrices using these average values. Alphabet reductions derived from the underlying similarity matrix as explained in the Materials and methods section are shown in Figure 1
.
This reduction of the amino acid alphabet followed several clearly recognizable paths, i.e. initially residues with similar physical/chemical properties are grouped together; large hydrophobes (LVIM), amino acids with large and mainly hydrophobic aromatic side chains (FY[W]) and long-chain positively charged residues (KR). The 10-letter alphabet of Figure 1
contains five amino acid groups, including the three groups mentioned plus two additional hydrophilic groups, alcohols (ST) and charged/polar residues (EDNQ). Further reductions of the 20-letter code coalesce smaller residues, and ultimately the code reduces to two basic groups, hydrophobic and hydrophilic. The initial divisions of the reduced alphabets are similar to the five-letter code of Riddle et al. (1997), i.e. the I, K, E, A, G alphabet found for the ß-barrel-like protein. Down to and including the level of reduction corresponding to the 10-letter alphabet in Figure 1
, these amino acids are maintained in separate groups. Thus the specific example of the reduced alphabet that Riddle et al. determined for SH3 is consistent with the proposed simplification scheme through 10 letters, but not below 10 because of AG pairing at the eight-letter level. Figure 1
serves as a guide to the construction of reduced alphabets which may be useful for correlating sequence patterns and folding patterns in a statistical sense as observed across a large number of folding families rather than as a recipe for constructing a particular protein using a very small alphabet.
Having constructed similarity matrices corresponding to reduced alphabets, we proceed to evaluate the extent to which sequence patterns derived from the reduced alphabets preserve the patterns needed to detect homologs in a clustered database. For this analysis we use the SCOP40 clustered database (Murzin et al., 1995
; Brenner et al., 1998
), containing 1323 proteins assigned to 639 folding families; no two homologous sequences share more than 40% sequence identity in this database. Coverage versus errors per query (EPQ) plots (Brenner et al., 1998
) used to assess sequence-based methods for homology detection provide the context for our analysis of the effect of reduced alphabets on fold recognition. Figure 2
(inset) shows the coverage versus EPQ plot for the SCOP40 database constructed using the complete 20-letter amino acid code. The coverage is defined as the fraction of homologous sequence pairs that have alignment scores above a threshold, while the EPQ is defined for the same threshold as the total number of non-homologous proteins with alignment scores above the threshold divided by the total number of queries made. As shown in Figure 2
(inset), there is, for example, a 20% coverage of SCOP40 at an EPQ of 0.1 when making use of the full information content of the 20-letter code. This means that we can detect approximately 20% of the true homologs in the SCOP40 database by sequence comparison with a 10% error rate (i.e. the alignment score threshold is set to a value such that 90% of the aligned pairs with scores greater than this threshold are homologous).
The effects of alphabet reduction on protein fold recognition were tested in the following way. The similarity matrices for 10 sets of increasingly reduced alphabets obtained by grouping the amino acids as shown in Figure 1
were assembled. For each of these reduced alphabets, all-against-all sequence alignments were performed using the SCOP40 database. Coverage versus EPQ plots were evaluated for each of the datasets. As the alphabet size is reduced, the information encoded in the amino acid sequences that is responsible for the protein fold recognition is lost. One way to characterize this is to compare the coverages of SCOP40 at a chosen error rate using the different reduced alphabets. The fractional coverage retained relative to the 20-letter alphabet at an EPQ value of 0.001 is shown in Figure 2
. There is a strong non-linear dependence of the fold recognition (the coverage) on the number of amino acids in the alphabet from which the similarity matrices (and thus the sequences) were constructed. As the alphabet is reduced from 20 letters to 12 or 10, the percentage coverage retained is reduced by only ~10%; further reduction of the alphabet is accompanied by a steep loss of fold recognition. With a four-letter alphabet, the coverage of the SCOP40 database at an EPQ of 0.001 is reduced by 90% relative to the complete 20-letter code. When the alphabet is reduced to two types of residues, hydrophilic and hydrophobic, there is no detectable fold recognition.
The results shown in Figure 2
correspond to the analysis of fold recognition using the entire SCOP40 database (Murzin et al., 1995
; Brenner et al., 1998
). We also have examined the effects of alphabet reduction on fold recognition using subsets of this database corresponding to five major fold categories of the SCOP classification scheme (Murzin et al., 1995
). We did not detect a strong dependence of the results on fold type; hence this analysis does not support the suggestion that ß-sheet containing proteins are less tolerant to an amino acid reduction than
-helical proteins (Riddle et al., 1997
). However, effects due to the differences in size and distribution of sequences within families among the different folding classes could obscure differences in the dependence of the coverage-EPQ plots on the alphabet.
| Discussion |
|---|
|
|
|---|
Is there a minimum number of letters required to fold a protein? Combinatorial protein synthesis indicates that some proteins can fold and function with far fewer than 20 amino acids (Sander and Schulz, 1979
Theoretical considerations concerning the folding of heteropolymers indicate that a certain minimum complexity in the polymeric building blocks is required for folding on both kinetic and thermodynamic grounds (Bryngelson et al., 1995
; Wolynes et al., 1995
; Hinds and Levitt, 1996
; Klimov and Thirumalai, 1996
; Shaknovich, 1996
; Dill and Chan, 1997
; Wolynes, 1997
). From the results of simplified lattice model simulations, it has been postulated that a minimum of three different amino acid types is required for protein folding, However, the true minimal alphabet may well require additional complexity for the creation of the large number of protein fold types of the kind actually observed in nature. Although the present studies do not address the physics of the problem in the way that lattice simulations are designed to do, this analysis of simplified amino acid alphabets required for protein fold recognition does have implications for the protein folding problem, particularly with regard to the relationship between reduced alphabets and the diversity of fold types. In the context of protein fold recognition by sequence alignment, we find that sequences constructed from 10-letter alphabets obtained by grouping amino acids appropriately contain nearly as much information as the natural sequences do.
| Acknowledgments |
|---|
This work was supported in part by a grant from the National Institutes of Health (GM-30580).
| Notes |
|---|
1 To whom correspondence should be addressed E-mail: ronlevy{at}lutece.rutgers.edu
| References |
|---|
|
|
|---|
Betz,S.F., Raleigh,D.P. and DeGrado,W.F. (1993) Curr. Opin. Struct. Biol., 3, 601610.
Brenner,S.E., Chothia,C. and Hubbard,J.P. (1998) Proc. Natl Acad. Sci. USA, 95, 60736078.
Bryngelson,J.D., Onuchic,J.N., Socci,N.D. and Wolynes,P.G. (1995) Proteins: Struct. Funct. Genet., 21, 167195.[ISI][Medline]
Collins,J.F. and Coulson,A.F.W. (1987) In Bishop,M.J. and Rawlings,C.J. (eds), Nucleic Acid and Protein Sequence Analysis; a Practical Approach, Vol. 3. IRL Press, Washington, D.C, pp. 323358.
Davidson,A.R., Lumb,K.J. and Sauer,R.T. (1995) Nature Struct. Biol., 2, 856864.[ISI][Medline]
Dill,K.A. and Chan,H.S. (1997) Nature Struct. Biol., 4, 1019.[ISI][Medline]
Feng,S., Chen,J.K., Yu,H., Simon,J.A. and Schreiber,S.L. (1994) Science, 266, 12411247.
Heinz,D.W., Baase,W.A. and Matthews,B.W. (1992) Proc. Natl Acad. Sci. USA, 89, 37513755.
Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 1091510919.
Hinds,D.A. and Levitt,M. (1996) J. Mol. Biol., 258, 201209.[ISI][Medline]
Kamtekar,S., Schiffer,J.M., Babik,J.M. and Hecht,M.H. (1993) Science, 262, 16801685.
Klimov,D.K. and Thirumalai,D. (1996) Proteins: Struct. Funct. Genet., 26, 411441.[ISI][Medline]
Landes,C. and Risler,J. (1994) CABIOS, 10, 453454.
Levinthal,C., Wodak,S.J., Kahn,P. and Dadivanian,A.K. (1975) Proc. Natl Acad. Sci. USA, 72, 13301334.
Matthews,C.R. (1993) Annu. Rev. Biochem., 62, 139160.[ISI][Medline]
Miyata,T., Miyazawa,S. and Yasunaga,T. (1979) J. Mol. Evol., 12, 219236.[ISI][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[ISI][Medline]
Myers,E.W. and Miller,W. (1988) CABIOS, 4, 1117.
Pearson,W.R. (1990) Methods Enzymol., 183, 6398.[ISI][Medline]
Plaxco,K.W., Riddle,D.S., Grantcharova,V. and Baker,D. (1998) Curr. Opin. Struct. Biol., 8, 8085.[ISI][Medline]
Regan,L. and Delgrado,W.F. (1988) Science, 241, 976978.
Riddle,D.S., Santiago,J.V., Bray-Hall,S.T., Doshi,N., Grantcharova,V.P., Yi,Q. and Baker,D. (1997) Nature Struct. Biol., 4, 805809.[ISI][Medline]
Risler,J.L., Delorme,M.O., Delacroix,H. and Henaut,A. (1988) J. Mol. Biol., 204, 10191029.[ISI][Medline]
Sander,C. and Schulz,G.E. (1979) J. Mol. Evol., 13, 245252.[ISI][Medline]
Santibanez,M. and Rohde,K. (1987) CABIOS, 3, 111114.
Shaknovich,E.I. (1996) Folding Des., 1, R50R54.[ISI][Medline]
Wolynes,P.G. (1997) Nature Struct. Biol., 4, 871874.[ISI][Medline]
Wolynes,P.G., Onuchic,J.N. and Thirumalai,D. (1995) Science, 267, 16191620.
Received June 1, 1999; revised November 11, 1999; accepted November 29, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Idicula-Thomas, A. J. Kulkarni, B. D. Kulkarni, V. K. Jayaraman, and P. V. Balaji A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli Bioinformatics, February 1, 2006; 22(3): 278 - 284. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Doi, K. Kakukawa, Y. Oishi, and H. Yanagawa High solubility of random-sequence proteins consisting of five kinds of primitive amino acids Protein Eng. Des. Sel., June 1, 2005; 18(6): 279 - 284. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Zhang, S. Kochhar, and M. G. Grigorov Descriptor-based protein remote homology identification Protein Sci., February 1, 2005; 14(2): 431 - 444. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Edgar Local homology recognition and distance measures in linear time using compressed amino acid alphabets Nucleic Acids Res., January 16, 2004; 32(1): 380 - 385. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Li, K. Fan, J. Wang, and W. Wang Reduction of protein sequence complexity by residue grouping Protein Eng. Des. Sel., May 1, 2003; 16(5): 323 - 330. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. I. Rogov and A. N. Nekrasov A numerical measure of amino acid residues similarity based on the analysis of their surroundings in natural protein sequences Protein Eng. Des. Sel., July 1, 2001; 14(7): 459 - 463. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






