Protein Engineering, Vol. 12, No. 10, 807-810,
October 1999
© 1999 Oxford University Press
Short Communications |
Skewed distribution of protein secondary structure contents over the conformational triangle
1 Department of Physics, Tianjin University, Tianjin 300072 and 3 Department of Epidemiology and Biostatistics, Tianjin Cancer Institute and Hospital, Tianjin 300060, China
| Abstract |
|---|
|
|
|---|
A conformational triangle method is presented to analyze the secondary structure contents of 1028 structurally known proteins in the non-redundant data set of the recent 25% PDB_SELECT. The secondary structure contents of each protein are mapped on to a point in the triangle. It was found that the distribution of the 1028 points is strongly skewed in the triangle and about 42% of the whole area is empty, which is called the forbidden area. The detailed border between the allowable and forbidden areas was calculated. The possible explanation of the skewed distribution is discussed. The distributions of the mapping points for enzymes and non-enzymes in this non-redundant data set are compared. It was found that a necessary rather than a sufficient condition for an enzyme molecule is that its coil content must be
0.223. It is hoped that the skewed distribution observed here could be used to test the secondary structure and threading predictions.
Keywords: coil content/conformational triangle/enzymes/flexibility/forbidden area/helix content/non-enzymes/strand content
| Introduction |
|---|
|
|
|---|
Biological data are increasing exponentially and a vast amount of biological information is emerging, and we are faced with the question of what they mean. It is a severe challenge to analyze these data. The great accumulation of biological data, including detailed structural data for more than 7000 proteins, provides a chance to discover new knowledge. Knowledge discovery and data mining (KDDM) are the main subject of bioinformatics today. This paper represents typical KDDM work attempting to look for some empirical rules describing the relationships of helix, strand and coil composition based on more than 1000 proteins of a non-redundant data set, in which the three-dimensional structures are currently available in the PDB.
The helix, strand and coil compositions of a protein are the fractions of residues in the conformations of
-helix, ß-strand and coil, respectively, where turns are treated as coil. There are different ways to assign one of the above three secondary structure types to each residue in a protein, based on its three-dimensional structural data (Kabsch and Sander, 1983
; Richards and Kundrot, 1988
; Sklenar et al., 1989
). In this paper, the method of Kabsch and Sander is used, i.e. the DSSP program is used to compute the secondary structures of proteins (Kabsch and Sander, 1983
). The conformation of H, G and I in the output file of the DSSP program is treated as helix, E and B as strand and all the remainder as coil. Hence coils defined here include turns. This is a simplified treatment only.
| Database and method |
|---|
|
|
|---|
The PDB is a very biased and highly redundant database. In order to obtain a precise result, we should use a non-redundant database. Here the recent 25% PDB_SELECT protein database of Hobohm et al. (1992) is used, in which the pairwise sequence identity is <25%. The version used here is the Release December 1998, in which there are 1028 proteins. We obtained these data via the web site ftp://ftp.embl-heidelberg.de/pub/databases/pdb_select.
Since three real numbers representing the contents of helix, strand and coil are associated with each protein, we have to analyze 3 x 1028 = 3084 data, which would occupy several printed pages. We hope to find something useful from such a large amount of data. This would be a difficult task. The strategy used is to visualize these data by a graphic technique. It will be seen later that the concept of a conformational triangle is introduced, by which the secondary structure composition of a protein (corresponding to three numbers) is mapped on to a point on a two-dimensional plane. Accordingly, a great amount of data can be studied in a perceivable form.
For convenience, the contents of
-helix, ß-strand and coil in a protein are denoted by
, ß and c, respectively. Obviously,
+ ß + c = 1 This means that among the three real numbers only two are independent. This provides a method to map the secondary structure composition of a protein into a regular triangle. Consider the regular triangle
ABC with its height equal to 1, as shown in Figure 1
. It is well known that the sum of the distances of any point within this triangle to the three sides is exactly equal to 1. Let the distances of a point P to the sides BC, AC and AB be equal to
, ß and c, respectively. The point P constitutes a mapping of the secondary structure composition of the protein studied. This is a mapping of the one-to-one correspondence. A Cartesian coordinate system is set up, in which the origin O is at the center of the triangle with the x-axis parallel with the side AB. The coordinate of the point P(x, y) may be expressed in terms of
and ß as follows:
|
|
and ß are the contents of
-helix and ß-strand for the protein studied. In this way, the points representing the secondary structure contents of the proteins studied are distributed within the triangle
ABC, which is called the conformational triangle hereafter. Consequently, some relationships of helix, strand and coil contents are found by studying the distribution of the mapping points.
|
| Results and discussion |
|---|
|
|
|---|
The distribution of the 1028 mapping points representing the secondary structure contents of the 1028 proteins, respectively, in the conformational triangle is shown in Figure 1
|
|
* and ß*, respectively. We calculate x* and y* by using Equation 1. If y*
y(x*), the mapping point is within the allowable area; otherwise, if y* < y(x*), the mapping point is within the forbidden area. As an example, consider the prion protein. It is well known that the prion protein has two possible structures, PrPC and PrPSc (Prusiner, 1982
protein with
* = 0.40 and ß*
0, whereas PrPSc is an
ß protein with
* = 0.30 and ß* = 0.43, according to the experimental report using FTIR and CD techniques (Pan et al., 1997
y(x*). Hence their mapping points are all situated at the allowable area (see Figure 2
|
The fact that 42.35% of the whole area of the conformational triangle belongs to the forbidden area is worthy of study. One possible explanation is presented in the following. The condition of the allowable area is y*
y(x*), as mentioned above. Using Equation 1, we transform this condition to c*
y(x*) + 1/3
(x*), where
(x*) is the cut-off for the coil content. The fact that
(x*) > 0, for any x*
[0.577, 0.577] (the whole interval of x), indicates that proteins are not allowed to have no coils (turns). Although this is trivial, it reflects the basic fact that coils (turns) are absolutely necessary for protein folding, whereas helices and strands are not always necessary. Observing Figure 1
(x*) [denoted by min
(x*)] for different protein classes are different. Using the borderline function y(x) in Equation 2, we find that for the all-
proteins, min
(x*) = 0.059; for the all-ß proteins, min
(x*) = 0.174 and for the
ß (including
/ß and
+ ß) proteins, min
(x*) = 0.217. Here the definition of structural classes proposed by Nakashima et al. (1986) is taken into account. These cut-off values reflect the different intrinsic structural characteristics of helix and strand. It is well known that hydrogen bonds turn out to be important for protein folding. For the
-helix, hydrogen bonds are formed between different residues within the
-helix itself, subject to the three-dimensional constraints. In contrast, for the ß-sheet, hydrogen bonds are formed between adjoining ß-strands, subject to the two-dimensional constraints (Chothia et al., 1997
-helix. This is probably one of the possible reasons why min
(x*) for the all-ß or
ß proteins is greater than min
(x*)for the all-
protein. This reasoning is also in agreement with the following observation. Fitting the 1028 mapping points by a straight line using a least-squares technique, we find the fitting line
![]() |
![]() |
and ß are the contents of helix and strand, respectively, associated with the point on the fitting line. The fact that the slope of the line in Equation 3 or 4 is greater than zero indicates that on average the content of ß-strand is positively correlated with that of coil, whereas the content of
-helix is negatively correlated with that of coil. In other words, overall, the more strands, the more coils there are and the more helices, the fewer coils there are. It is well known that of the three secondary structural elements the helix is generally the least flexible and the coil is the most flexible with the strand in between (Schulz and Schirmer, 1979
-helices (Chothia et al., 1997
(x*), indicates that flexibility of coils is very necessary for the stable folding of proteins. Summarizing, the appearance of the forbidden area in the conformational triangle seems to be relevant to the flexibility of protein structures.
Proteins are thought to be structurally and dynamically flexible molecules. The flexibility of some (not all) proteins is necessary to their functions, especially for the enzymatic functions (Tsou, 1986
). To illustrate this, the mapping points of the 441 enzymes in the recent 25% PDB_SELECT protein database of Hobohm et al. (1992) are shown in Figure 2
. Comparing the distribution in Figure 2
with that in Figure 1
, we find that some mapping points with less coil contents are `filtered'. There exists a threshold of coil content (denoted by
) for enzymes. Based on the data in Figure 2
, we find that
= 0.223. A necessary rather than a sufficient condition for an enzyme molecule is that its coil content must be
0.223. The value of
may be changed with respect to the enlargement of the protein database; however, a substantial deviation from 0.223 in the future is unlikely. In other words, at least about one quarter of the residues of the enzyme molecule must assume the coil (including turn) conformation. As mentioned above, coils are probably more flexible than helices and strands. Therefore, the existence of a larger threshold
indicates that the functions of enzymes need more flexible conformational elements such as coils. Nevertheless, a protein with a higher coil content may not be an enzyme. We performed a statistical test to see whether the distributions of the coil contents between enzymes and non-enzymes are significantly different. The average contents of helix, strand and coil and their variances for the 441 enzymes and 587 (1028 441) non-enzymes were calculated and are listed in Table I
. Based on these data, a t-test was performed and it was found that the two coil content distributions (enzymes compared with non-enzymes) are not significantly different with a significance level of 0.05. Summarizing, the constraint that the coil content of an enzyme must be greater than or equal to a threshold
is only a necessary condition for a protein to be an enzyme, but by no means a sufficient one.
|
In conclusion, the skewed distribution of the secondary structure contents over the conformational triangle reported in this paper is a worthwhile finding. The existence of a larger forbidden area in the secondary structure composition space seems to be related to the flexibility of protein structures. However, an exact explanation for the forbidden area is still not available. Any future successful protein folding theory should give this phenomenon a satisfactory explanation. At present, the skewed distribution observed here could be used to test the secondary structure and threading predictions.
| Acknowledgments |
|---|
This work was supported in part by the Pandeng Project of China and a grant from the State Education Commission of China.
| Notes |
|---|
2 To whom correspondence should be addressed. E-mail: ctzhang{at}tju.edu.cn
| References |
|---|
|
|
|---|
Aguzzi,A. and Weissmann,C. (1997) Nature, 389, 795798.[Medline]
Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1997) Curr. Opin. Struct. Biol., 7, 369376.[Web of Science][Medline]
Chothia,C., Hubbard,T., Brenner,S., Barns,H. and Murzin,A. (1997) Annu. Rev. Biophys. Biomol. Struct., 26, 597627.[Web of Science][Medline]
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.[Web of Science][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[Web of Science][Medline]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152162.
Pan,K.M., Baldwin,M. Nguyen,J., Casset,M., Serban,A., Groth,D., Mehlhorn,I., Huang,Z., Oliva,B., Bates,P.A., Querol,E., Aviles,F.X. and Sternberg,M.J.E. (1997) J. Mol. Biol., 266, 814830.[Web of Science][Medline]
Prusiner,S.B. (1982) Science, 216, 136144.
Richards,F.M. and Kundrot,C.E. (1988) Proteins, 3, 7184.[Web of Science][Medline]
Schulz,G.E. and Schirmer,R.H. (1979) Principles of Protein Structure. Springer, New York.
Sklenar,H., Etchebest,C. and Lavery,R. (1989) Proteins, 6, 4660.[Web of Science][Medline]
Tsou,C.L. (1986) Trends Biochem. Sci., 11, 427429.[Web of Science]
Received April 14, 1999; revised July 8, 1999; accepted July 8, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
N. P. Cowieson, G. King, D. Cookson, I. Ross, T. Huber, D. A. Hume, B. Kobe, and J. L. Martin Cortactin Adopts a Globular Conformation and Bundles Actin into Sheets J. Biol. Chem., June 6, 2008; 283(23): 16187 - 16193. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




