Protein Engineering, Vol. 15, No. 12, 951-953,
December 2002
© 2002 Oxford University Press
COMMUNICATION |
Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks
Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, Italy
| Abstract |
|---|
|
|
|---|
A hybrid system (hidden neural network) based on a hidden Markov model (HMM) and neural networks (NN) was trained to predict the bonding states of cysteines in proteins starting from the residue chains. Training was performed using 4136 cysteine-containing segments extracted from 969 non-homologous proteins of well-resolved 3D structure and without chain-breaks. After a 20-fold cross-validation procedure, the efficiency of the prediction scores as high as 80% using neural networks based on evolutionary information. When the whole protein is taken into account by means of an HMM, a hybrid system is generated, whose emission probabilities are computed using the NN output (hidden neural networks). In this case, the predictor accuracy increases up to 88%. Further, when tested on a protein basis, the hybrid system can correctly predict 84% of the chains in the data set, with a gain of at least 27% over the NN predictor.
Keywords: cysteine bonding state/disulfide bridges/hidden Markov models/hidden neural networks/neural networks
| Introduction |
|---|
|
|
|---|
The bonding state of cysteines plays a relevant role in stabilizing the tertiary folds of proteins and in defining protein functions. Among the amino acid residues, cysteines are unique, since they can create covalent bonds between two non-contiguous residues in the protein chain. Moreover, reduction of disulfide bridges triggers functionally relevant conformational changes (Creighton, 1996
The contribution of the disulfide bridge to the thermodynamic stability of proteins has been described as being due to a reduction in the conformational entropy of the unfolded polypeptide chain causing a destabilization of the unfolded state relative to the native state (for a review, see Betz) (Betz, 1993
) and it can be estimated both experimentally (Privalov and Gill, 1988
; Freire, 1993
) and theoretically (Casadio et al., 1995
). Several analyses of the characteristics of disulfide bonds in proteins have been performed, including structural and sequence features and classification of connectivity (Harrison and Sternberg, 1994
). This strengthens the view that disulfide bonds increase the conformational stability of the protein mainly by constraining the unfolded conformation, as many experimental and theoretical studies suggest (Harrison and Sternberg, 1994
); for a review see Wedemeyer et al. (Wedemeyer et al., 2000
).
Moreover, the disposition of cysteine residues relative to each other and relative to protein secondary structure is important in the classification of the structure of small disulfide-rich irregular proteins (Harrison and Sternberg, 1996
).
In protein folding prediction, the location of disulfide bridges can strongly reduce the search in the conformational space (Skolnick et al., 1997
; Huang et al., 1999
). Therefore, the correct prediction of the disulfide connectivity starting from the protein residue sequence may help in predicting also its 3D structure.
A few studies have addressed the important problem of predicting the bonding state of cysteine in a protein chain. The correct prediction of this state can help in predicting ab initio the 3D structure of proteins by adding structural constraints and also in predicting the correct connectivity of disulfide bridges in the protein (Fariselli and Casadio, 2001
). The relevance of the flanking residues in predicting a cysteine bonding state has been demonstrated using statistical methods (Fiser et al., 1992
), neural networks (Muskal et al., 1990
; Fariselli et al., 1999
) and methods that combine local context and global information about protein sequences (Fiser and Simon, 2000
; Mucchielli-Giorgi et al., 2002
).
In this paper, we present an approach based on hidden neural networks (HNN) that combines neural networks and hidden Markov models and outperforms all the existing methods.
| Materials and methods |
|---|
|
|
|---|
The database
4136 segments containing cysteines [free and disulfide bonded (half-cystines)] were taken from the crystallographic data of the Brookhaven Protein Data Bank. Disulfide bond assignment was based on the Define Secondary Structure of Proteins (DSSP) program (Kabsch and Sander, 1983
).
Non-homologous proteins (with an identity value <25% and without chain breaks) were selected using the PAPIA system (Noguchi et al., 2001
). Segments whose cysteines are inter-chain disulfide bonded are included as free cysteines in the database (34 out of 27 monomeric chains). After this filtering procedure, the total number of proteins was 969, with 4136 cysteine-containing segments, 1446 of which were in the disulfide-bonded state and 2690 in the non-disulfide-bonded state. For each protein in our database, a profile based on a multiple sequence alignment was created using the BLAST program on the non-redundant dataset of sequences. The profiles obtained are used for creating the neural network input.
During the training/testing phase, the database was split into 20 subsets (almost equally sized and distributed) in order to perform a 20-fold cross-validation. Moreover, in order to highlight the method accuracy better, the performance was evaluated using (i) the whole dataset of proteins (WD) and (ii) a reduced set (RD) in which chains containing only one cysteine are excluded.
The PDB codes of the proteins whose cysteine-containing segments are included in the database, the 20-fold cross-validation lists and the training profiles are available at http://www.biocomp.unibo.it/piero/cyspred/cysdataset.tgz.
Measures of performance
The efficiency of the predictors is scored using the statistical indices defined as follows.
![]() | (1) |
The correlation coefficient C is defined as
![]() | (2) |
The accuracy for each discriminated structure s is evaluated as
![]() | (3) |
The probability of correct predictions P(s) is computed as
![]() | (4) |
Finally, the accuracy per protein is
![]() | (5) |
Neural networks
Standard feed-forward neural networks are implemented with a back-propagation algorithm as learning procedure. The network architecture is similar to that used previously (Fariselli et al., 1999
) and consists of a two-layer perceptron with two hidden neurons, one output node (discriminating the disulfide and free cysteine propensities, respectively) and an input layer that consists of 540 neurons (27 residue-long input window). Owing to the limited number of examples currently available, an early learning stopping procedure was used to train the networks (Fariselli et al., 1999
).
Hidden neural network
A vector-based HMM that can handle emission probability vectors is used on top of the neural networks described above. The hybrid system is a defined hidden neural network, following Krogh and Riis (1999)
. A vector-based HMM, similar to that used in this paper, was recently developed and applied to the prediction of transmembrane ß-barrel proteins (Martelli et al., 2002
).
Briefly, if L is the number of cysteines in the protein and A is the size of the alphabet over which vectors are built (that is, A = 2, bonding and non-bonding/free cysteine states), we refer to this sequence vector with the notation
![]() | (6) |
The HMM for the specific problem at hand is composed of a Markov model with N states connected by means of the transition probabilities aij (Figure 1
). The probability density function for the emission of a vector from each state is determined by a number A of parameters that are peculiar for each state k and are indicated with the symbols ek(c) (with c = 1,2, ..., A):
![]() | (7) |
t is the tth state in the path. Z is the normalizing factor with
cek(c) = 1 [for further details, see Martelli et al. (Martelli et al., 2002
|
The vector st is obtained directly from the neural network outputs as
![]() | (8) |
Training the HMM parameters is accomplished by using a modified expectation-maximization algorithm (Martelli et al., 2002
). In order to keep the constraints derived by the selected HMM model (Figure 1
), the prediction of each cysteine is made using one protein at a time and by means of the Viterbi decoding (Durbin et al., 1998
).
| Results and discussion |
|---|
|
|
|---|
The NN-based predictor is to be considered as the basic component of the hybrid system. Its accuracy compares well with that previously obtained with a similar method (Fariselli et al., 1999
|
When the NN is integrated with the HMM and the HNN method is tested, the results are indeed improved. In Table II
|
The improvement obtained with the HNN method compared with NN is seemingly due to the introduction of global rules defined by the regular grammar implemented in the HMM (Figure 1
Remarkably, the accuracy obtained on protein bases is increased up to 80.2% for the difficult set (RD) and to 84.0% for the entire database (Q2prot in Table II
).
Even though it is difficult to compare methods tested on different databases, it can be claimed that the accuracy obtained with HNN is greater than that previously described and obtained with other methods, incorporating also global protein rules (Fiser and Simon, 2000
; Mucchielli-Giorgi et al., 2002
). The method implemented by Fiser and Simon is based on a simple majority rule and reaches an accuracy of 82% when predicting the disulfide bonding state of cysteines on a small set of proteins comprising 81 chains; that of Mucchielli-Giorgi et al. makes use of global protein descriptors and scores as high as 84% for the same task on 869 chains. The higher accuracy (88%) obtained with HNN on 969 chains is probably due to the higher flexibility of our system in capturing features of the sequences essential for the prediction of the cysteine bonding state.
In conclusion, it has been shown that a hybrid system combining local with global information outperforms previously developed methods to solve the same task, confirming that for the problem at hand a crucial step forward can be made only when global features of the protein chains are taken into consideration.
| Notes |
|---|
1 To whom correspondence should be addressed. E-mail: casadio{at}alma.unibo.it
| Acknowledgments |
|---|
This work was partially supported by a grant from the Ministero della Università e della Ricerca Scientifica e Tecnologica (MURST) for the project Hydrolases from Thermophiles: Structure, Function and Homologous and Heterologous Expression, a grant for a target project in Biotechnology and a project on Molecular Genetics, both from the Italian Centro Nazionale delle Ricerche (CNR), to R.C. R.C also acknowledges an EC grant, Biowulf IST 1999-20232, for supporting the development of DNCBLAST, a parallelized version of PSI-BLAST for PC nets. P.L.M. is the recipient of a fellowship from the Italian National Institute of Biostructures and Biosystems (INBB).
| References |
|---|
|
|
|---|
Betz,S.F. (1993) Protein Sci., 2, 15511558.[Web of Science][Medline]
Casadio,R., Compiani,M., Fariselli,P. and Vivarelli,F. (1995) Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 8188, and references therein.[Medline]
Creighton,T. (1996) Proteins: Structures and Molecular Properties. Freeman, San Francisco.
Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge.
Fariselli,P. and Casadio,R. (2001) Bioinformatics, 17, 957964.
Fariselli,P., Riccobelli,P. and Casadio,R. (1999) Proteins, 36, 340346.[CrossRef][Web of Science][Medline]
Fiser,A. and Simon,I. (2000) Bioinformatics, 6, 251256.
Fiser,A., Cserzo,M., Tudos,E. and Simon,I. (1992) FEBS Lett., 302, 117120.[CrossRef][Web of Science][Medline]
Freire,E. (1993). Arch. Biochem. Biophys., 303, 181184.[CrossRef][Web of Science][Medline]
Harrison,P.M. and Sternberg,M.J.E. (1994) J. Mol. Biol., 244, 448463, and references therein.[CrossRef][Web of Science][Medline]
Harrison,P.M. and Sternberg,M.J.E. (1996) J. Mol. Biol., 264, 603623.[CrossRef][Web of Science][Medline]
Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) J. Mol. Biol., 290, 267281.[CrossRef][Web of Science][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[CrossRef][Web of Science][Medline]
Krogh,A. and Riis,S.K., (1999) Neural Comput., 11, 541563.[CrossRef][Web of Science][Medline]
Martelli,P.L., Fariselli,P., Krogh,A. and Casadio,R. (2002) Bioinformatics, 18, S1, 4653.
Mucchielli-Giorgi,M.H., Hazout,S. and Tuffery,P. (2002) Proteins, 46, 243249.[CrossRef][Web of Science][Medline]
Muskal,S.M., Holbrook,R.S. and Kim,S.H. (1990) Protein Eng., 3, 667672.
Noguchi,T., Matsuda,T.H. and Akiyama,Y. (2001) Nucleic Acids Res., 29, 219220.
Privalov,P.L and Gill,S.J. (1988) Adv. Protein Chem., 39, 191324.[Web of Science][Medline]
Skolnick,J., Kolinski,A. and Ortiz,A.R. (1997). J. Mol. Biol., 265, 217241.[CrossRef][Web of Science][Medline]
Wedemeyer,W.J., Welkler,E., Narayan,M. and Scheraga,H.A. (2000). Biochemistry, 39, 42074216.[CrossRef][Medline]
Received May 28, 2002; accepted October 10, 2002.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. Singh A review of algorithmic techniques for disulfide-bond determination Brief Funct Genomic Proteomic, March 27, 2008; (2008) eln008v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Ferre and P. Clote DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W182 - W185. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Ferre and P. Clote DiANNA: a web server for disulfide connectivity prediction Nucleic Acids Res., July 1, 2005; 33(suppl_2): W230 - W232. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Ferre and P. Clote Disulfide connectivity prediction using secondary structure information and diresidue frequencies Bioinformatics, May 15, 2005; 21(10): 2336 - 2346. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Lin, V. A. Simossis, W. R. Taylor, and J. Heringa A simple and fast secondary structure prediction method using hidden neural networks Bioinformatics, January 15, 2005; 21(2): 152 - 159. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Passerini and P. Frasconi Learning to discriminate between ligand-bound and disulfide-bound cysteines Protein Eng. Des. Sel., April 1, 2004; 17(4): 367 - 373. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Czaplewski, S. Oldziej, A. Liwo, and H. A. Scheraga Prediction of the structures of proteins with the UNRES force field, including dynamic formation and breaking of disulfide bonds Protein Eng. Des. Sel., January 1, 2004; 17(1): 29 - 36. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












