Protein Engineering, Vol. 12, No. 12, 1041-1050,
December 1999
© 1999 Oxford University Press
Prediction of protein secondary structure content
Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, MI 49007-4940 and 1 Department of Computer and Information Science, Indiana University Purdue University Indianapolis, Indianapolis,IN 46202-5132, USA
| Abstract |
|---|
|
|
|---|
All existing algorithms for predicting the content of protein secondary structure elements have been based on the conventional amino-acid-composition, where no sequence coupling effects are taken into account. In this article, an algorithm was developed for predicting the content of protein secondary structure elements that was based on a new amino-acid-composition, in which the sequence coupling effects are explicitly included through a series of conditional probability elements. The prediction was examined by a self-consistency test and an independent dataset test. Both indicated a remarkable improvement obtained when using the current algorithm to predict the contents of
-helix, ß-sheet, ß-bridge, 310-helix,
-helix, H-bonded turn, bend and random coil. Examples of the improved accuracy by introducing the new amino-acid-composition, as well as its impact on the study of protein structural class and biologically function, are discussed.
Keywords: 1st-order coupled components/
-helix/ß-sheet/ß-bridge/310-helix/
-helix/H-bonded turn/bend, random coil
| Introduction |
|---|
|
|
|---|
One of the biggest challenges in molecular biology is how to predict the three-dimensional (3D) structure of a protein given only its amino acid sequence. To help reach such a goal, various approaches targeting different levels and aspects of protein structure were initiated, such as secondary structure prediction (see, for example, Chou and Fasman, 1978; Fasman, 1989), structural class prediction (see, for example, Nakashima et al., 1986; Chou, 1989; Chou, 1995; Chou and Zhang, 1995; Bahar et al., 1997; Liu and Chou, 1997), domain class prediction (Chou et al., 1998
-helices and ß-sheets in a protein are symbolized by
and ß, respectively. For the category of protein structural classes, proteins with
40% and ß
5% are classified as the
-protein class; proteins with
5% and ß
40% are classified as the ß-protein class; and so forth (see, for example, Chou, 1995). Now if the results of secondary structure content prediction by some method for a given protein are
= 40% and ß = 0%, while its corresponding observed values are actually
= 60% and ß = 5%, this means there is a deviation of |60 40%| =20% = 0.2 and |5 0%| = 5% = 0.05 in the prediction of
-helix and ß-sheet content, respectively. Such a deviation would represent a significant error with respect to the accuracy of secondary structure content prediction; however, in the case of structural class prediction, the protein is correctly predicted as an
-protein class without any error at all. The above example presents us with a picture about the difficulty of developing an accurate method to predict the secondary structure content of proteins. Probably because of this, in contrast to structural class prediction, much less work has been done on secondary structural content prediction. It is well known that the knowledge of a priori secondary structure content can provide useful information in determining protein structure. Particularly, it also has a close relevance to many experimental methods such as circular dichroism (CD) spectroscopy (Sreerama and Woody, 1994In a pioneering study, Krigbaum and Knutton (1973) introduced the multiple linear regression (MLR) algorithm to predict the secondary structure content of a protein based on its amino acid composition. Muskal and Kim (1992) approached the problem in a different way when they developed a tandem neural network method in which the protein's amino acid composition, molecular weight and heme presence were taken into account. Recently, by incorporating some nonlinear terms as well as knowledge of protein structural class, Zhang et al. (1996, 1998) proposed a new approach to predict the amount of secondary structure in a globular protein. According to their report, the predicted results of Zhang et al. (1995, 1998) are better than those of Krigbaum and Knutton (1973) and Muskal and Kim (1992). However, in Zhang's method, the a priori knowledge of structural class of the query protein is needed to perform the prediction of its secondary structure content. Thus, as a consequence, this method has some limitations. Besides, the amino-acid-composition defined in all the aforementioned methods is the 0th-order coupled composition, as defined by
|
|
where A, C, D, E, ..., and Y represent the single-letter codes of the 20 amino acids and P(A) represents the proportion of amino acid A (alanine) in a given protein, P(C) the proportion of C (cystenine), P(D) the proportion of D (aspartic acid), and so forth. As we can see from eqn 1, each amino acid component was treated independently, i.e. the coupling effects among the 20 amino acid components were not incorporated at all. The amino-acid-composition thus defined is actually the 0th-order coupled composition, as denoted by the subscript 0 of
in eqn 1.
Obviously, the 0th-order-coupled system is the lowest approximation. If we wish to incorporate the coupling effects of residues along a sequence so as to reflect more accurately the reality in a protein, how can we develop a method to predict its secondary structure content? The present study was initiated in an attempt to deal with this problem.
| Algorithm |
|---|
|
|
|---|
When the coupling effect of a residue with those adjacent to it is taken into account, the proportion factors in eqn 1 should be replaced by the 1st-order conditional proportions and the number of factors will increase from 20 to 20 x 20 = 400; i.e. the 0th-order coupled amino-acid-composition
0 should be replaced by the 1st-order coupled composition as formulated by: |
|
where P(C|A) is the proportion of amino acid C occurring along a protein sequence from the N- to the C-terminus, given that A has occurred immediately preceding it; P(D|C) is the proportion of amino acid D occurring along the same sequence, given that C has occurred just preceding D; and so forth.
Generally speaking, if the coupling effects of the
(
= 2, 3, ...) closest neighboring amino acid residues are to be considered, then eqn 1 should be modified to be an
th-order coupled amino-acid-composition consisting of 20
+1 components, each of which would correspond to an
th-order conditional proportion. As one could surmise, the analysis of a higher-order coupled system would be much more complicated. Therefore, the treatment in this paper is confined to the 1st-order coupled system; i.e. only the coupling effect of the closest adjacent amino acids is taken into account, as formulated by eqn 2.
The current method is established on the basis of eqn 2, which formulates a conditional probability contribution from each amino acid in the sequence given that it is immediately preceded by a particular one of the 20 amino acids. Accordingly, the 1st-order coupled amino-acid-composition (eqn 2) introduced here involves explicit representation of sequential properties that are not included in the conventional amino-acid-composition, or the 0th-order coupled amino-acid-composition, as formulated by eqn 1.
Suppose the 20 native amino acids are denoted by Xi (i =1, 2, ..., 20) in the alphabetical order of their single-letter codes, i.e. X1 = A, X2 = C, ..., X20 = Y, then according to the normalization condition we have
|
|
For brevity, the 400 components in eqn 2 are denoted by y1, y2, ..., y400. The rationale of the current method is the secondary structure content of a protein is correlated with its amino-acid-composition; however, compared with the 0th-order composition, such a correlation would be more accurately reflected in terms of the 1st-order coupled composition. Thus, the content of a secondary structural element in a protein, e.g.
-helix, can be estimated by the following equation:
|
|
where
represents the
-helix content, n
the number of residues occurring in the
-helices of a given protein and n the number of its total residues, while F
(y1, y2, ..., y400) is a function to be determined. Expanding the function F
according to Taylor series at y1 = y2 = ... = y400 = 0, we have
|
|
where the subscript 0 means that the value of the corresponding term is obtained by substituting y1 = y2 = ... = y400 = 0 into it. Since all yi (i = 1, 2, ..., 400) in a real protein are generally << 1 with an average equal to
= 0.0025 and the derivatives are bounded for real-world situations, the third term and above in eqn 5 can be neglected. Thus, we approximately have
|
|
where c
= F
and c
= (
F

yi)0. The coefficients c
(i = 0, 1, ..., 400) can be determined through a training dataset by the following procedure.
Suppose in a given training dataset there are N proteins identified by an index k, and its 400 coupled-components are denoted by yk,1, yk,2, ..., yk,400. In order to determine the coefficients of eqn 6, we define an objective function given by
|
|
where d
is the content of
-helices in the kth protein and derived here from the DSSP file (Kabsch and Sander, 1983
) of the kth protein in a given training dataset, as done in Chou et al. (1998). The process of determining the coefficients c
(i = 0, 1, ..., 400) is actually a process of finding the minimum of Q
, and hence a process of solving the following set of linear algebraic equations
|
|
Actually, the procedure adopted here is essentially the least squares solution to the multiple regression problem. It can be shown that eqn 8 usually has a unique solution if N, the number of proteins in the training dataset, is equal to or greater than 401 (see Appendix A). Accordingly, all the coefficients c
(i = 0, 1, ..., 400) in eqn 7 can be derived. We may also use singular value decomposition to obtain the least squares solution. Substituting them into eqn 6, we immediately obtain the desired equation for predicting the content of
-helices in a query protein.
Following a similar procedure, we can also predict the content of ß-sheet, its parallel and antiparallel fractions, as well as the content of ß-bridges, 310-helices,
-helices, H-bonded turns, bends and random coils for a given protein. Accordingly, in parallel to eqn 6, a general formulation for predicting all the secondary structure elements can be written as
|
|
where
is a general symbol for all the secondary structure elements, and c
(j = 0, 1, 2, ..., 400) are also called the 1st-order coupled `rule-parameters' for predicting the content of the secondary structural element
. When
= `
', eqn 9 will yield the content of
-helices; when
= `ß', the content of ß-sheets; when
= `parallel', the content of parallel ß-sheets; when
= `antiparallel', the content of antiparallel ß-sheets; when
= `bridge', the content of ß-bridges; when
= `310', the content of 310-helices; when
= `
', the content of
-helices; when
= `H-bond', the content of H-bonded turns; when
= `bend', the content of bends; and when
= `coil', the content of random coils. Note that by definition the secondary structure content must be within the range 0 to 1 (see eqn 4). Therefore, if it was found that
> 1 or
< 0, the value of
should be assigned to 1 or 0, respectively. However, cases like that happened very rarely.
In order to facilitate comparison, here let us also give the corresponding equations based on the conventional amino-acid-composition (eqn 1). By following the procedures parallel to the above derivation, these equations can be easily obtained as follows.
|
|
is actually the proportion of amino acid Xi in a protein whose secondary structure contents are to be predicted (see eqn 1), and b
(j = 0, 1, 2, ..., 20) are the 0th-order coupled `rule-parameters' for predicting the content of the secondary structural element
as can be derived by the following equations:
|
|
where xk,1, xk,2, ..., xk,20 are the 20 0th-order coupled components (see eqn 1) as usually defined for the amino-acid-composition of the kth protein in the training dataset, and dk
is a general symbol for the observed content of the secondary structure element
in the kth protein. When
= `
', it becomes dk
of eqn 7 that is none but the observed content of
-helices in the kth protein. As mentioned here, the observed value of dk
(
= `
', `ß', `bridge', `310', `
' or any other secondary structural element) can be derived from the DSSP file (Kabsch and Sander, 1983
) of the kth protein in a given training dataset.
A comparison of eqns 1012 with eqns 79 indicates that all the sequence-coupled effects are no longer counted for the result predicted by eqn 10. This is because all the conditional probability terms, which were originally associated with the 1st-order coupled rule-parameters in eqn 9, are degenerated into the independent amino-acid-composition terms (see eqns 10 and 11).
| Results and discussion |
|---|
|
|
|---|
As mentioned in the section of Algorithm and shown in Appendix A, in order to find the unique solution of c
in eqn 9 (i = 0, 1, 2, ..., 400), the number of proteins in a training dataset must be greater than or equal to 401. In the current study, 628 proteins of known structure were selected for the training dataset, where the similarity between any two sequences is no more than 25%. Listed in Table I
(i = 0, 1, 2, ..., 400) for predicting the secondary structure content.
|
The results were examined through a self-consistency test and independent-dataset test. The following three errors were introduced to evaluate the prediction quality:
the average absolute error for each secondary structure element 
|
|
the standard deviation for each secondary structure element 
|
|
and the overall average error <
>
|
|
where
= `
', `ß', ..., or `coil',
k is the predicted content for the secondary structure element
in the kth protein, while d
is the corresponding observed content, and
is the total number of the secondary structure elements considered; that is, 10 for the current study.
Self-consistency test
In this test, the rule parameters derived from the 628 proteins in Table I
by eqns 78 were used to predict the secondary structure content of the same proteins by eqn 9. The 10 sets of 1st-order coupled rule parameters (each contains 401 coefficients) thus found for predicting the content of
-helices, ß-sheets, its parallel and antiparallel proportions, ß-bridges, 310-helices,
-helices, H-bonded turns, bends and random coils, respectively, are given in Appendix B. The results of the self-consistency test for the 628 proteins in Table I
are given in Table II
, from which we can see that the average absolute errors for the prediction of
-helices and ß-sheets are 0.056 and 0.046 with a standard deviation of 0.008 and 0.005, respectively. For the other secondary structure elements, except for the proportions of parallel and antiparallel ß-strands, the average errors were all
0.020 with a standard deviation of
0.001. The average absolute error for the prediction of the parallel and antiparallel ß-strand portions are relatively large. However, even though the overall average error for all the 10 secondary structure elements is 0.062, by excluding these two from consideration, the overall average error becomes 0.028, indicating an excellent self-consistency by using the 1st-order couple composition regression algorithm. To show the prediction quality, the calculated and observed content of
-helices and ß-sheets in each of the 628 proteins are shown in Figure 1a and b
, respectively.
|
|
To provide a comparison, the self-consistency test, using the same protein dataset, was also performed for the prediction algorithm based on the conventional amino-acid-composition (eqns 1012), and the corresponding results are also listed in Table II
Although prediction errors reported above are very small, it should be pointed out here that they are merely the results obtained by the self-consistency test based on a limited number of proteins. Using the self-consistency test, the secondary structure content of each protein from a training dataset is predicted using the coefficients derived from the same dataset. In other words, the rule parameters derived from the training dataset include information about a protein later tested. This will certainly give an overly optimistic error estimate because of the memorization effect. Nevertheless, the self-consistency test is absolutely necessary because it reflects the consistency of a prediction method, especially for its algorithm part. A prediction algorithm certainly cannot be deemed a good one if it is non-consistent. In other words, the self-consistency test is necessary but not sufficient for evaluating a prediction method. As a complement, a cross-validation examination based on an independent testing dataset is needed as given below.
Independent-dataset test
Testing on a set of proteins not present in the training dataset is important because it can reflect the effectiveness of a prediction method, especially in checking the validity of a training dataset: whether it contains sufficient information to reflect all the important features concerned so as to yield high prediction quality in application. For cross-validation, an independent testing dataset was constructed. It consisted of 52 proteins with known structures (Table III
). The sequence similarity between two proteins in this dataset, or between a protein in this dataset and any one in the training dataset (Table I
), is no more than 35%. The secondary structure contents of these proteins were calculated in terms of the rule parameters derived from the proteins of the training dataset by the 0th- and 1st-order coupled algorithms, respectively. The results thus obtained for the content of
-helices and ß-sheets, together with the corresponding observed values, are listed in Table III
. As we can see there, for each of the 52 proteins the content predicted by the 1st-order-coupled algorithm for both
-helices and ß-sheets are much closer to the observed values than those by the 0th-order coupled algorithm.
|
It is intriguing to note that the improvement in prediction quality by taking into account the 1st-order coupled effect may provide new structural or even biological insight by correcting the errors deduced from the 0th-order-coupled algorithm. For example, the hydrophobic protein from soybean (1hyp.pdb) is a typical
protein (Figure 2a
-helical content is 47% and that of its ß-sheets is 0%. But according to the predicted results by the 0th-order coupled algorithm, its overall structure would be incorrectly classified as the
/ß or
+ ß class because the contents of its
-helix and ß-sheets thus obtained were 24% and 16%, respectively (Table III
-helices and ß-sheets for the same protein were 45% and 7% (Table III
protein (Chou et al., 1998
-helices and ß-sheets are 0 and 63%. But the corresponding contents predicted by the 0th-order coupled algorithm were 24 and 23% (Table III
-helices and ß-sheets were 3 and 63%, and hence the protein would be correctly assigned as a ß-protein (Chou et al., 1998
/ß (Figure 2c
+ ß class (Figure 2d
|
Much evidence indicates that there is some correlation between the overall structural features of a protein and its biological function. For example, the low-frequency motion (such as an accordion-like or breathing-like motion) of
-helices, ß-sheets or ß-barrels in some proteins is vitally important for their biological function (Chou, 1988
-helices and ß-sheets as well as their arrangement in the protein. Also, many enzymes have the overall structure of type
/ß (Farber and Petsko, 1990
-helices and ß-sheets in a protein might provide useful insights into not only its overall structural features but also its biological function. It should be pointed out that although in principle the algorithm formulated here can be used to predict the percentage of parallel and antiparallel ß-sheets in a protein, the results are relatively much poorer than those of the other secondary structure elements. To improve this situation, the incorporation of some special effect into the algorithm might be necessary.
| Conclusion |
|---|
|
|
|---|
The conventional amino-acid-composition as defined in eqn 1 is a 0th-order-coupled composition. In comparison, using the 1st-order-coupled composition as formulated by eqn 2 can improve the prediction quality of protein secondary structure content. For example, the average absolute errors for predicting the contents of
-helices and ß-sheets were 0.056 and 0.046 in the self-consistency test (Table III| Appendix A |
|---|
|
|
|---|
For the reader's convenience, here let us show that eqn 8 usually has a unique solution if N, the number of proteins in the training dataset, is equal to or greater than 401. Substituting eqn 7 into eqn 8, we obtain
|
|
where yk,0 = 1 is a dummy symbol. The above equation can be written as
|
|
where
|
|
T is the transposition operator, and
|
|
Accordingly, we have
|
|
If XTX is invertible, C
has a unique solution
|
|
The condition that XTX is invertible requires N
401. When N
401 and when the N proteins selected for the training dataset are not homologous to one another, XTX is usually invertible.
| Appendix B |
|---|
|
|
|---|
|
| Acknowledgments |
|---|
This work was supported in part by the grant from the National Science Foundation (NSF DUE-9555408). We would also like to thank the two anonymous referees whose constructive comments were very helpful for improving the paper.
| Notes |
|---|
2 To whom correspondence should be addressed
| References |
|---|
|
|
|---|
Bahar,I., Atilgan,A.R., Jernigan,R.L. and Erman,B. (1997) Proteins, 29, 172185.[Web of Science][Medline]
Bode,W., Papamokos,E. and Musil,D. (1987) Eur. J. Biochem., 166, 673692.[Web of Science][Medline]
Bussian,B.M. and Sander,C. (1989) Biochemistry, 28, 42714277.
Chou,K.C. (1988) Biophys. Chem., 30, 348.[Web of Science][Medline]
Chou,K.C. (1995) Proteins Struct. Funct. Genet., 21, 319344.[Web of Science][Medline]
Chou,K.C. (1997a) J. Peptide Res., 49, 120144.[Web of Science][Medline]
Chou,K.C. (1997b) Biopolymers, 42, 837853.[Web of Science][Medline]
Chou,K.C. and Blinn,J.R. (1997) J. Protein Chem., 16, 575595.[Web of Science][Medline]
Chou,K.C. and Elrod,D.W. (1999) Protein Engng, 12, 107118.
Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275349.[Web of Science][Medline]
Chou,K.C., Liu,W., Maggiora,G.M. and Zhang,C.T. (1998) Prot. Struct. Funct. Genet., 31, 97103.
Chou,P.Y. (1908) Amino Acid Composition of Four Classes of Proteins. In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.
Chou,P.Y. (1989) Prediction of Protein Structural Classes from Amino Acid Composition. In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549586.
Chou,P.Y. and Fasman,G.D. (1978) Adv. Enzymol. Relat. Subj. Biochem., 47, 45148.
Dubchak,I., Holbrook,S.R. and Kim,S.-H. (1993) Proteins, 16, 7991.[Web of Science][Medline]
Farber,G.K. and Petsko,G.A. (1990) Trends Biochem. Sci., 15, 228234.[Web of Science][Medline]
Fasman,G.D. (1989) The Development of the Prediction of Protein Structure. In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 317358.
Folmer,R.H., Nilges,M., Konings,R.N. and Hilbers,C.W. (1995) EMBO J., 14, 41324142.[Web of Science][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[Web of Science][Medline]
Krigbaum,W.R. and Knutton,S.P. (1973) Proc. Natl Acad. Sci. USA, 70, 28092813.
Lehmann,M.S., Pebay-Peyroula,E., Cohen-Addad,C. and Odani,S. (1989) J. Mol. Biol., 210, 235236.[Web of Science][Medline]
Liu,W. and Chou,K.C. (1997) J. Protein Chem., 17, 209217.
Liu,W. and Chou,K.C. (1998) Protein Sci., 7, 23242330.[Web of Science][Medline]
Muskal,S.M. and Kim,S.-H. (1992) J. Mol. Biol., 225, 713727.[Web of Science][Medline]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152162.
Pastore,A., Saudek,V., Ramponi,G. and Williams,R.J.P. (1992) J. Mol. Biol., 224, 427440.[Web of Science][Medline]
Sreerama,N. and Woody,R.W. (1994) J. Mol. Biol., 242, 497507.[Web of Science][Medline]
Zhang,C.T., Zhang,Z. and He,Z. (1996) J. Protein Chem., 15, 775786.[Web of Science][Medline]
Zhang,C.T., Zhang,Z. and He,Z. (1998) J. Protein Chem., 17, 261272.[Web of Science][Medline]
Received March 9, 1999; revised July 24, 1999; accepted August 5, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
Q.-B. Gao and Z.-Z. Wang Classification of G-protein coupled receptors at four levels Protein Eng. Des. Sel., November 1, 2006; 19(11): 511 - 516. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Bhasin and G. P. S. Raghava GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors Nucleic Acids Res., July 1, 2005; 33(suppl_2): W143 - W147. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Wang, J. Yang, G.-P. Liu, Z.-J. Xu, and K.-C. Chou Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition Protein Eng. Des. Sel., June 1, 2004; 17(6): 509 - 516. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



