Protein Engineering, Vol. 14, No. 2, 75-79,
February 2001
© 2001 Oxford University Press
COMMUNICATION |
Using subsite coupling to predict signal peptides
Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, MI 49007-4940, USA. E-mail: kuo-chen.chou{at}am.pnu.com
| Abstract |
|---|
|
|
|---|
Given a nascent protein sequence, how can one predict its signal peptide or `Zipcode' sequence? This is a first important problem for scientists to use signal peptides as a vehicle to find new drugs or to reprogram cells for gene therapy. Based on a model that takes into account the coupling effect among some key subsites, the so-called {3, 1, +1} coupling model, a new prediction algorithm is developed. The overall rate of correct prediction for 1939 secretory proteins and 1440 non-secretary proteins was over 92%. It has not escaped our attention that the new method may also serve as a useful tool for helping investigate further many unclear details regarding the molecular mechanism of the ZIP code protein-sorting system in cells.
Keywords: {3, 1, +1} coupling/non-secretory proteins/secretory proteins/`Zipcode' sequence
| Introduction |
|---|
|
|
|---|
The knowledge of protein signals can be used to reprogram cells in a specific way for future cell and gene therapy. Protein signals have become a crucial tool for researchers to construct new drugs that are targeted to a particular organelle to correct a specific defect. For example, by adding a specific tag to the desired proteins, one can tag them for excretion, making them much easier to harvest (Hagmann, 1999
The existing methods in this area are based mostly on the use of neural networks (Claros et al., 1997
; Nielsen et al., 1999
; Nakai, 2000
). They are actually the application of machine learning techniques. As pointed out by King (1996), the advantages of neural network prediction methods are that they are `readily available' and `often successful in practice'. He also pointed out that the disadvantages are that `there is little use of chemical or physical theory', the methods have `very poor explanatory powera Hinton diagram means nothing to a protein chemist' and `they are statistically rather poorly characterized'. Besides, although the computational costs for training the networks were considerably higher, the prediction accuracy thus obtained was not higher (and sometimes even lower) than the analytical methods. The current study was initiated in an attempt to develop an automated method based on the sub-site coupling principle that can be used to identify signal peptides faster and more accurately.
| Materials and methods |
|---|
|
|
|---|
Signal peptides comprise the N-terminal part of the secretory protein chain. They control the entry of virtually all proteins to the secretory pathway, in both eukaryotes and prokaryotes (Gierasch, 1989
1, ..., 3, 2, 1, +1, +2, ...,
2 (Figure 3
1, +
2]. When sliding the scaled window [
1, +
2] along a sequence of n residues, one can consecutively highlight n (
1 +
2) + 1 different sequences. Note that for the current study the identification of cleavage site is very important because it is directly correlated with a correct prediction of the signal peptide. For example, instead of the site (-1, +1), if the cleavage site is identified at (2, 1) or (+1, +2), then the corresponding signal peptide thus derived will be one residue shorter or longer than the actual one (Figure 1
1, +
2] along a protein sequence of n residues, one can generate one, and only one, secretion-cleavable segment and n (
1 +
2) non-secretion-cleavable segments if the protein is secretory, but n (
1 +
2) + 1 non-secretion-cleavable segments if it is non-secretory. All the secretion-cleavable segments form a cleavable or positive set denoted by S+ and all the non-secretion-cleavable segments form a non-cleavable or negative set S.
|
|
|
Segments generated by sliding the scaled window [
1, +
2] along protein sequences can be generally expressed as
![]() | (1) |
where R
1 represents the residue at the scale
1, R1 the residue at the scale 1, R+1 the residue at the scale +1 and so forth.
If the amino acid residue at each of the segment subsites (Equation 1
) can be treated as an independent element, i.e. there is no coupling at all among these subsites, then its attribute to the cleavable set S+ and that to the non-cleavable set S can be formulated, respectively, as
![]() | (2) |
![]() |
![]() |
![]() |
![]() |
![]() |
where Pi+(Ri) is the probability of amino acid Ri occurring at the subsite i (=
1, ..., 3, 2, 1, +1, +2, ..., +
2) for the secretion-cleavable segments and Pi(Ri) the corresponding probability for the non-secretion-cleavable segments. The values of the former can be derived from a positive training data set S0+ consisting of only secretion-cleavable segments and the values of the latter can be derived from a negative training data set S+0 consisting of only non-cleavable segments. The subscript 0 of
indicates that the attribute function is formed by independent probabilities in which no coupling effect between subsites is included, as shown by the right-hand side of Equation 2
. However, in reality the protein subsites are often coupled with one another. Therefore, it is instructive to conduct a statistical analysis for the 1939 secretory protein sequences retrieved from Nielsen et al. (1997). The result thus obtained is illustrated in Figure 4
, from which we can see that the amino acid residues at the subsites 3, 1 and +1 are mostly occupied by Ala. Furthermore, according to the detailed numbers generated through the statistical analysis, of the 1939 protein sequences, the occurrence frequencies of Ala at the subsites 3, 1 and +1 are 667, 1084 and 397, respectively, while the occurrence frequencies of the other 19 amino acids at these subsites are relatively much lower. Besides, all these three subsites are very close to the cleavage site (Figure 1
). This suggests that a highly special match between the signal peptidase and the secretory protein at the subsites 3, 1 and +1 is required during the cleavage process. Accordingly, to establish a powerful method for predicting the signal peptides, the coupling among these three key subsites, i.e. the {3, 1, +1} coupling, must be taken into account. Thus, Equations 2a and 2b
should be modified to
|
![]() | (3) |
![]() |
![]() |
![]() |
![]() |
![]() |
respectively, where Pi+(Ri) and Pi(Ri) are the same as those in Equation 2
. P+1(R1|R3) is the probability of amino acid R-1 occurring at the subsite 1, given that R-3 has occurred at the subsite 3; P++1(R+1|R1) is the probability of amino acid R+1 occurring at the subsite +1, given that R1 has occurred at the subsite 1. Their values can be derived from a positive training data set S+0 consisting of only secretion-cleavable peptides. Also, P1(R1|R3) and P+1(R+1|R1) have the same meaning as P+1(R1|R3) and P++1(R+1|R1) except that they are derived from a negative training data set S0 consisting of only non-cleavable peptides.
Thus, for a given peptide sequence as defined in Equation 1
, if its attribute function to the positive training set S0+ is greater than that to the negative training set S0, i.e.
+ >
, then the sequence is predicted to be secretion-cleavable; otherwise, it is predicted to be non-secretion-cleavable. We define a discriminant function
, given by
![]() | (4) |
![]() |
![]() |
where w+ and w are the weight factors for the attribute functions derived from the positive training data set S+0 and negative training data set S0, respectively. If there is no special reason, they are generally set to be one i.e. w+ = w = 1. Thus, the criterion of predicting the seretion-cleavability for a given peptide sequence can be formulated as follows:
![]() |
![]() |
During the training process, the parameters
1 and
2 can be changed so as to find the optimal prediction quality. Once a secretion-cleavable peptide is predicted, the corresponding cleavage site and signal peptide are automatically obtained as described above (cf. Figures 1 and 3a![]()
).
| Results and discussion |
|---|
|
|
|---|
To show the power of the key-subsites-coupled algorithm, the following two criteria should be followed: (1) using a good data set that is accessible to the public and (2) comparison with the best result reported in the literature. The data set investigated by Nielsen et al. (1997) satisfies the first criterion; it can be retrieved from an FTP server at ftp://virus.cbs.dtu.dk/pub/signalp. They consist of 1939 secretory proteins and 1440 non-secretory proteins. The former contains 416 human, 1011 eukaryote, 105 Escherichia coli, 266 Gram negative and 141 Gram positive proteins, and the latter 251 human, 820 eukaryote, 119 E.coli, 186 Gram negative and 64 Gram positive proteins. Redundant sequences were removed to guarantee that no pairs of homologous sequences exist in the data set. As treated by Nielsen et al. (1997), for the secretory proteins, the sequence of the signal peptide and the first 30 amino acids of the mature protein were included in the data set, whereas for the non-secretory proteins, the first 70 amino acids of each sequence were included. According to their report, the average rate of correct prediction for the cleavage site location by the neural network method was 71.54%. This is the highest success rate so far reported for such a large data set available to the public. Therefore, the result reported by Nielsen et al. (1997) also satisfies the second criterion. To compare the prediction quality at an equivalent condition, we used the same data set as used by Nielsen et al. (1997).
The rate of correct prediction for the signal peptide set and non-signal peptide set are given by
![]() | (6) |
![]() |
where N + represents the total number of signal peptides and m+ is the number of signal peptides missed in prediction; N is the total number of non-signal peptides and m is the number of non-signal peptides incorrectly predicted as signal peptide. The overall rate of correct prediction concerned is given by
![]() | (7) |
The prediction quality was examined by the standard testing procedure in statistics (Mardia et al., 1979
), that is, a combination of the self-consistency and jackknife tests. In the former, the signal peptide of each protein in a given data set was predicted using the parameters derived from the same data set, the so-called training data set, whereas in the latter, each protein in the training data set was singled out in turn as a `test protein' and all the rule-parameters were derived from the remaining proteins. Compared with the independent data set test and sub-sampling test often adopted in biology, the jackknife test is considered to be the most effective method for cross-validation in statistics (Mardia et al., 1979
). This is because in the independent data set test, the selection of a testing data set is arbitrary and the accuracy thus obtained lacks an objective criterion unless the testing data set is sufficiently large (Chou and Zhang, 1995
). As for the sub-sampling test in which a given data set is divided into several subsets, the problem is that the number of possible divisions might be too large to be handled. For example, in the treatment by Nielsen et al. (1977), each data set was divided into five approximately equal size parts and then every network run was carried out with one part as test data and the other four parts as training data. The performance measures were then calculated as an average over the five different data set divisions. Thus, even for the data of only secretory proteins, the number of possible combinations would be
=
1x
2x
3x
4x
5, where
1 = 416!/(83!83!83!-83!84!),
2 = 1011!/(202!202!202!202!202!),
3 = 105!/(21!21!21!21!21!),
4 = 266!/(53!53!53!53!54!) and
5 = 141!/(28!28!28!28!29!). Of
1,
2,
3,
4 and
5, the smallest is
3
3.1x1069, implying
would be > >15.5x10345. It is impossible for any existing computer to handle such a huge number of combinations. In fact in any practical sub-sampling tests as performed by Nielsen et al. (1997), only a very small fraction of the possible combinations were investigated and the results thus obtained could not avoid a considerable arbitrariness. Accordingly, the testing procedure adopted here is much more objective and rigorous.
Prediction was performed by selecting different parameters for the scaled window [
1, +
2]. Preliminary tests indicated that for a given
1 the optimal result for
+ was obtained when
2 = 2. The predicted results by both self-consistency and jackknife tests with different values of
1 are given in Table I
, from which we can see that the overall success rate
is improved with increase in
1. However, if
1 is too large, many short signal peptides will be excluded. For example, two signal peptides were excluded when
1 = 10, five when
1 = 12, six when
1 = 13, eight when
1 = 14, 13 when
1 = 15, 52 when
1 = 16 and 186 when
1 = 18. Each of these excluded signal peptides was counted as an unsuccessful prediction event, contributing to the reduction of the success rate for the prediction of signal peptides. As a consequence,
+ was gradually reduced when
1
16 (Table I
). As a compromise, we select
1 = 13, 14 or 15 and
2 = 2 as the optimal parameters for the scaled window [
1, +
2]. When
1 and
2 are within these values, the success rates
+ (Equation 6
) for the signal peptide set are over 93 and 89% by self-consistency and jackknife tests, respectively, while the corresponding success rates
(Equation 6
) for the non-signal peptide set are both over 92%. Also, the overall success rates (Equation 7
) for the cleavage site location by both self-consistency and jackknife tests are over 92%.
|
Besides the neural network (NN) method proposed by Nielsen et al. (1997), there are some other methods, such as the simple weight matrix method (von Heijne, 1986
| Conclusion |
|---|
|
|
|---|
Since the present model has explicitly incorporated the coupling among the subsites 3, 1 and +1 and all these subsites are very close to the cleavage site, it can be directly used for investigating the protein secretion-cleaved mechanism by signal peptidase. The present model can also serve as a useful vehicle for helping further investigate many unclear details regarding the molecular mechanism of the ZIP code protein-sorting system in cells. Furthermore, since signal peptides are the key in determining the subcellular location of proteins, the {3, 1, +1} model might have some impact in improving the prediction quality of protein subcellular location (Cedano et al., 1997
| Acknowledgments |
|---|
Illuminative discussions with Dr A.P.Elhammer and Dr Jinhe Li are gratefully acknowledged. The author also thanks Raymond B.Moeller, Cynthia A.Ludlow and Wendy Vanderheide for their help in drawing the figures.
| References |
|---|
|
|
|---|
Baldi,P. and Brunak,S. (1998) Bioinformatics: the Machine Learning Approach. MIT Press, Cambridge, MA.
Cedano,J., Aloy,P., Perez-Pons,J.A. and Querol,E. (1997) J. Mol. Biol., 266, 594600.[Web of Science][Medline]
Chou,K.C. (2000) Curr. Protein Pept. Sci., 1, 171208.[Medline]
Chou,K.C. and Elrod,D.W. (1998) Biochem. Biophys. Res. Commun., 252, 6368.[Web of Science][Medline]
Chou,K.C. and Elrod,D.W. (1999a) Protein Eng., 12, 107118.
Chou,K.C. and Elrod,D.W. (1999b) Proteins: Struct. Funct. Genet., 34, 137153.[Web of Science][Medline]
Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275349.[Web of Science][Medline]
Claros,M.G., Brunak,S. and von Heijne,G. (1997) Curr. Opin. Struct. Biol., 7, 394398.[Web of Science][Medline]
Gierasch,L.M. (1989) Biochemistry, 28, 923930.[Medline]
Hagmann,M. (1999) Science, 286, 666666.
King,R.D. (1996) In Sternberg,M.J.E. (ed.), Protein Structure Prediction: a Practical Approach. IRL Press, Oxford, pp. 7997.
Ladunga,I. (1999) Bioinformatics, 15, 10281038.
Mardia,K.V., Kent,J.T. and Bibby,J.M. Multivariate Analysis. Academic Press, London, 1979, pp. 322 and 381.
Nakai,K. (2000) Adv. Protein Chem., 54, 277344.[Web of Science][Medline]
Nielsen,H., Engelbrecht,J., Brunak S. and von Heijne,G. (1997) Protein Eng., 10, 16.
Nielsen,H., Brunak S. and von Heijne,G. (1999) Protein Eng., 12, 39.
Rapoport,T.A. (1992) Science, 258, 931936.
Reinhardt,A. and Hubbard,T. (1998) Nucleic Acids Res., 26, 22302236.
von Heijne,G. (1986) Nucleic Acids Res., 14, 46834690.
Received October 13, 2000; revised November 29, 2000; accepted December 8, 2000.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. S. Kim, B. G. Hunter, J. Kraft, R. S. Boston, S. Yans, R. Jung, and B. A. Larkins A Defective Signal Peptide in a 19-kD {alpha}-Zein Protein Causes the Unfolded Protein Response and an Opaque Endosperm Phenotype in the Maize De*-B30 Mutant Plant Physiology, January 1, 2004; 134(1): 380 - 387. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Liang, C. VanValkenburgh, X. Chen, C. Mullins, L. Van Kaer, N. Green, and H. Fang Genetic Complementation in Yeast Reveals Functional Similarities between the Catalytic Subunits of Mammalian Signal Peptidase Complex J. Biol. Chem., December 19, 2003; 278(51): 50932 - 50939. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Fan, C. Y. Wu, C. W. Chen, T. W. Chang, and C. Lim Preparing a human membrane and secreted protein-enriched cDNA library using PCR primers derived from a genomic database Nucleic Acids Res., November 15, 2001; 29(22): e114 - e114. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



























