Protein Engineering, Vol. 12, No. 1, 15-21,
January 1999
© 1999 Oxford University Press
A neural network based predictor of residue contacts in proteins
1 Biocomputing Group (Centro Interdipartimentale per le Ricerche Biotecnologiche) and 2 Laboratory of Biophysics, Department of Biology, University of Bologna, Via Irnerio, 42, 40126 Bologna, Italy
| Abstract |
|---|
|
|
|---|
We describe a method based on neural networks for predicting contact maps of proteins using as input chemico-physical and evolutionary information. Neural networks are trained on a data set comprising the contact maps of 200 non-homologous proteins of well resolved three-dimensional structures. The systems learn the association rules between the covalent structure of each protein and its correspondent contact map by means of a standard back propagation algorithm. Validation of the predictor on the training set and on 408 proteins of known structure which are not homologous to those contained in the training set indicate that this method scores higher than statistical approaches previously described and based on correlated mutations and sequence information.
Keywords: neural networks/protein structure prediction/contact maps/multiple sequence information
| Introduction |
|---|
|
|
|---|
Contacts between protein residues constrain protein folding and characterize different protein structures. Therefore prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. Provided that residue contacts are known for a protein sequence, the major features of its three-dimensional (3D) structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. The prediction of contacts between residues requires the study of the protein inter-residue distances related to the specific type of amino acid pair. The distribution of pair residue distances in proteins of known 3D structure was first computed to derive mean force potentials with the aim of tackling the folding and the inverse folding problem (Miyazawa and Jernigan, 1985; Sippl, 1990
In this paper we address the question whether a neural network system is capable of learning the correlation between the residue covalent structure of a protein and its contact map as it can be computed from its known 3D structure. It has been shown that neural networks are efficient tools in solving several kind of problems, including predictions of secondary structures (Rost and Sander, 1995
), transmembrane helices (Rost et al., 1996
) and folding initiation sites (Compiani et al., 1998
). Neural networks have been used in the pioneer work by Bohr et al. (1990) to predict distances between amino acids of homologous protein sequences and also to represent knowledge based potentials (Grossman et al., 1995
). Recently neural networks have been successfully applied to predict whether distances between pairs of amino acids are above or below a given variable threshold (Lund et al., 1997
).
Our neural network system is trained to learn the correlation between protein sequences and contact maps of a set of non-homologous proteins of known structure including 200 chains. With a cross-validation procedure on the training set and blind tests on another 408 proteins never seen before by the network, our results indicate that the neural network-based predictor achieves on average an improvement over a random predictor of a factor of 6, which is higher than that previously obtained by other methods developed to predict residue contacts (Thomas et al., 1996
; Olmea and Valencia, 1997
). Although the predictive accuracy obtained with neural networks is still far from being successful, it is however rather independent of the protein size, different from that previously obtained by combining correlated mutations and sequence conservation (Olmea and Valencia 1997
). Our results are obtained by introducing a new coding procedure based on `coupled residue' representation.
| Materials and methods |
|---|
|
|
|---|
Database
Neural networks were trained and tested on a database of proteins selected from the Protein Data Bank (PDB; Bernstein et al., 1977
) using the PDB_select algorithm (Hobohm et al., 1992
). For the training phase proteins with an identity value <25% were extracted from the PDB_select_oct_1996 file (http://www.embl-heidelberg.de). This set was then reduced by excluding those chains whose backbone is interrupted. Moreover in order to avoid false contacts due to the presence of hetero-atoms, only those proteins without ligands in the PDB files were kept in the training set (set LRN in Table I
). A full cross-validation procedure was routinely applied on the training set (LRN), by grouping 25 proteins (selected at random) in eight subsets. One subset at a time was used as a testing set and the network was trained on the remaining seven. Furthermore, in order to test the generalization capabilities of our predictors two more sets of sequences (whose sequence identity is <25%) were also used as testing sets. One set included those proteins listed in the PDB_select oct_1996 file and not included in the training set (namely 225 proteins containing ligands in the PDB files; COF in Table I
). The other testing set comprised 184 chains, selected from PDB_ select_oct_1997 and not present in the previous release (TS97 in Table I
).
|
Contact map generation and contact analysis
Given a protein of Lp residues, its contact map S is evaluated as a symmetric matrix consisting of LpxLp elements whose values are set to 1 if there is a contact between two residues and to 0 otherwise. In order to capture the physical properties of the contacts, the contact maps of all the sequences were generated taking into account the whole set of coordinates of the heavy atoms of the protein (all atoms except hydrogen). In this way a contact is assigned if the minimal distance between heavy atoms belonging to the side chain or to the backbone of two residues is <4.5 Å. This threshold value is chosen in order to guarantee that there is neither a third residue nor a water molecule between two amino acids in contact and it is equal to that previously used in other studies (Hinds and Levitt, 1992
; Mirny and Domany, 1996
). In order to prevent spurious short range contacts (Thomas et al., 1996
), our procedure does not include contacts between residues whose sequence separation is less than four residues. The training and testing sets of protein contact maps comprise 608 examples, evaluated from our database. These maps contain on average a number of contacts (Nc) much lower than the number of non-contacts (NNC) (Nc/NNC equals about 1/60). This is due to the different scaling factor of Nc and NNC. Also, as pointed out by others (Vendruscolo et al., 1997
), NC increases almost linearly with protein sequence length (data not shown). For this reason NNC increases with the square of the protein length.
Another important point to shed light on is the distribution of residue contacts. This is also necessary for determining the most appropriate input coding to the neural network system. By plotting the frequency of contacts in our protein database as a function of the residue sequence separation, it appears that the contact distribution pattern is far from uniform (Figure 1
), and that contacts cluster predominantly in the short and middle range of the residue sequence separation.
|
Neural network architecture
Our predictor is a classical feed-forward neural network trained with a standard back-propagation algorithm (Rumelhart et al., 1986
) to associate protein single sequences to their corresponding contact maps. We implemented a network topology which was a reasonable trade off between the size of the data set (about 6x106 examples of contacts and non contacts for 200 proteins) and the contact representation among residues. After preliminary experiments (results not shown), we found that the best architecture in terms of computational expedience was that characterized by one output neuron representing the contact propensity, one hidden layer containing two neurons and one input layer with different number of neurons depending on the amount of information encoded. We implemented five different networks of increasing input complexity. All networks used contain two input neurons, representing respectively the normalized residue gap [the length of the sequence separation divided by 1000 (maximum of protein length in residues)] and the normalized sequence separation (the sequence length divided by 1000) between the couple of residues whose contact propensity is under examination. This was done in order to take into account the properties of each contact map (see the previous paragraph and Figure 1
). When this information was not encoded, a dramatic drop of the network performance was noticed, especially for large proteins (data not shown). Major characteristics of the different networks based on the above detailed architecture are listed below.
Net1 contains 210 input neurons representing all the possible ordered couples of residues, considering that each residue couple and its symmetric are coded in the same way. This representation reduces the number of weight junctions from 20x20 to 20x(20+1)/2 input neurons. The input neuron representing the ordered couple of amino acidic residues at positions i and j is set to 1, while the remaining 209 are set to 0. The total number of input neurons of Net 1 is then 212, including also the two other neurons associated respectively to the protein length and sequence separation.
Net2 as compared with Net1 contains two more input neurons in order to take also into account the hydrophobic neighbourhood of the residue couples. Hydrophobicity values are computed according to a hydrophobicity scale (Rose et al., 1985
), by averaging over a window of seven contiguous residues. Net2 is provided with 214 input neurons.
Net3 is implemented to encode evolutionary information, by considering the alignment from the corresponding HSSP files (Sander and Schneider, 1991
). As depicted in Figure 2
, all the possible couples generated by residues in positions i and j of the different aligned sequences are counted. After normalization to the number of sequences, the frequencies of occurrence in the alignment of each couple is used in the corresponding position of the 210 element input vector representing all the possible ordered couples. By this, the 210 element vector may have more than one component activated. Two more neurons are then added to take into account the conservation weights of the positions i and j in the alignment. As in the case of Net2, Net3 also codes for the hydrophobic neighbourhood of the residues i and j. Thus, the final number of input neurons is 216.
|
With Net4, the sequence context of each residue is weighted by taking a window centred in the residue under investigation, with a length of three amino acids. Both parallel and antiparallel pairings of the two segments centred in i and j is considered. This leads to the coding of the couples formed by the residues in positions {i1, j1},{i, j},{i+1, j+1} (parallel pairing) and {i1, j+1},{i, j},{i+1, j1} (antiparallel pairing). The five possible combinations ({i1, j1},{i, j},{i+1, j+1},{i1, j+1}, {i+1, j1}) of the ordered couples require 1050 (210x5) input neurons. Two more neurons were added for the hydrophobic neighbourhood for a total of 1054 input neurons.
Finally with Net5 we add to Net4 the evolutionary information with the same procedure described for Net3. In particular, all five couples are coded according to the evolutionary information, and as in Net3 two more neurons code for the conservation weights associated with the alignment positions i and j. In this case the final number of input neurons is 1056.
In order to compensate for the disproportion between contacts and non-contacts during the training phase, learning was accomplished by means of a procedure including a balancing probability factor to reduce the number of back-propagation cycles for the most abundant class (NNC) (Fariselli et al., 1993
).
Evaluation of the predictor accuracy
We are actually interested in the network capability of predicting residue contacts. Therefore accuracy (A) of the network performance is scored as
A = Ncp*/Ncp (1)
where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. However, the evaluation of accuracy (Eq. 1) requires an estimation of Ncp and this is described in the following. The network system uses a single output neuron whose outcomes are real numbers in the interval [0, 1], depending on their activation level. Therefore, for discriminating contacts from non-contacts a possible strategy is to order network outputs according to their distance from the maximum level of activation (1). Then, based on the observation that a protein is characterized by Nc contacts, only the top Nc outputs are labelled as contacts predicted by the network. By this, the number of predicted contacts is restricted to that actually present in the protein (Ncp = Nc). This procedure is similar to that previously adopted by authors using a statistical method to predict contact maps (Thomas et al., 1996
) and it has been chosen in order to compare our results to those already published. In our case, selecting only the Nc top outputs of the network is almost equivalent to setting the decision threshold value for the accepted contacts to >0.8 for each protein.
For comparison, accuracy of a random predictor (Ar) can be evaluated as
Ar = Nc/Np (2)
where Nc is the number of real contacts in the protein of length Lp, and Np are all the possible outcomes. Since the contact map S is symmetric and residues whose sequence gap is <4 are not included, Np is computed to be equal to (Lp 4)x(Lp 3)/2.
| Results |
|---|
|
|
|---|
The performances of Net1 is tested on different sets using a full cross-validation procedure. It was discussed before (Thomas et al., 1996
|
In order to test if the performance of the networks are robust with respect to the change of the testing sets two other sets comprising proteins not homologous to the previous ones are used: one contains only proteins with bonded hetero-atoms (COF) and the other includes proteins recently added to the protein data bank (TS97) (see Table I
Upon increasing the information content provided to the networks, the accuracy level also increases, as shown in Table III
for each subset of proteins of different length. In this table the performance of the different networks used are listed as a single average value over the whole database (Table I
). This is done averaging the values obtained after cross-validation on LRN (as described above) with the average value obtained from the eight training sets (from LRN) on the proteins comprised in the COF and TS97 test sets. The monotonic trend of the average accuracy of the different networks suggests that all the added information are useful for contact predictions (see the rightmost column of Table III
). Actually, the performance of Net2 increases from 0.144 to 0.147, when the hydrophobic propensity of the residues is taken into account; moreover a further improvement is noticeable when multiple sequence information is added. In this case the performance of Net3 increases up to 0.15. With Net4, the sequence context information is introduced, by taking into account explicitly a three residue-long window centred on the residue whose contact is to be predicted. The final average value is not very different from that obtained with Net3 (0.151 with respect to 0.150). The results indicate that the evolutionary information and the sequence context information similarly contribute to the increase in the predictive performance of the method. This is also evident by considering that for proteins of small size (protein whose length is shorter than 170 residues) Net3 performs better than Net4 (Table III
). For larger proteins, however, Net4 outdoes Net3. In conclusion, it appears that the information is complementary, and that the combination of both evolutionary and sequence context information further increases the accuracy, as demonstrated by the fact that the accuracy of Net5 reaches 0.160 on the whole set of proteins.
|
The accuracy obtained using Net5 for the individual proteins is shown in Figure 3
170) than for large size proteins (Lp > 170) and that the accuracy average value is rather independent of the chain length.
|
An example of the output of the Net5 predictor is shown in Figure 4
|
In Figure 5
|
| Discussion |
|---|
|
|
|---|
Feed-forward neural networks trained with the back propagation algorithm are ideally suited for investigation of the correlation between input and output data, if one exists. In this paper we show that it is possible to predict contact maps of proteins starting from the residue sequence with an accuracy which is comparable with that of other statistical methods. However the advantage of neural networks with respect to the statistical methods previously used stems from the high degree of flexibility of this system, which encodes in the connection weights the complex rules of association between sequence and structure depending on the different input coding procedures. Indeed in this paper much effort is devoted to determine the most appropriate input coding for the problem at hand. We introduce a new coding procedure in that residues are coded as couples in the input vectors. By this residue pairs are presented to the networks and the corresponding connection weights are updated only when the very same couple is submitted to the system. This procedure partially avoids self averaging of the residue contacts and prevents the loss of couple specificity. Furthermore, the input code also includes two additional nodes, coding for the protein sequence length and the sequence separation between residues in contact. Apparently, this additional information, which accounts for the characteristics of each contact map, makes contact prediction rather independent of the protein size. This is noticeable when the results are grouped according to protein size and it explains why neural networks are better for large proteins than previous methods.
With neural networks it is also easy to change and increase the amount of information in the coding input. Variables such as hydrophobicity of the environment, sequence context and evolutionary information are easily fed to the networks and their effect can be investigated. One interesting result is that we show that all these variables when taken into consideration separately improve the accuracy of the predictor in a stepwise manner. When the performance is analysed by grouping the proteins by chain length it appears that evolutionary information is more relevant for predicting proteins whose chain length is <170 residues. For longer chains the sequence context seems to play a role. Indeed when all the variables are taken into consideration simultaneously the best predictor is obtained.
The generalization capability of the neural network-based predictor can be compared with that of other methods previously described and based on approaches different from ours. The best performances described so far were obtained with a statistical and a joint method, respectively (Thomas et al., 1996
; Olmea and Valencia, 1997
). The first approach achieves an accuracy level of 0.13 (five times higher than a random predictor) in predicting contact maps generated from computing CßCß distances with a threshold value for the contacts set to 8.0 Å. However, the accuracy of the method drops below 0.1 when contact maps generated taking into account all-atom representation (unfortunately with an unspecified contact threshold) are predicted. Even though our contact maps computed from all atom interactions cannot be directly compared with those previously described (due to the unspecified contact threshold) it is however evident from our results that Net5 achieves an accuracy level of at least 6 percentage points higher (<A> = 0.160 and 6 times higher than that of a random predictor, Table III
).
The second approach focuses on the prediction on the CßCß contacts and the accuracy is improved in different ways (Olmea and Valencia, 1997
). The most successful one adds to the correlated mutation information, the statistical propensities derived from the different number of contacts that the different residues occupy in the context of the three most representative secondary structure types (
-helices, ß-strands and random coils). In this way a filter procedure is generated, which takes into account the secondary structure labels for the different residues from the DSSP files (Kabsch and Sander, 1983
); moreover, for each residue predicted contacts are accepted provided that their number is less than the average number of contacts per residue in a given structural type. The result is to lower the final number of the predicted contacts by 25%, increasing remarkably the performance accuracy. In order to compare our predictions with that obtained with this method, we consider therefore only the first 75% of the real contacts (Figure 5
). The final results are reported in Table IV
, where the occupancy method (Occ; Olmea and Valencia, 1997
) is compared with the filtered Net5 (Net5_75 in Table IV
). From the data, it is evident that the occupancy method performs similarly to our predictor for small proteins. For proteins of large size (>100 residues), neural networks perform with an accuracy which is better than other methods and comparable only to that obtained with Occ. However, it should be remembered that in our case the prediction includes the all atom representation of the protein and is therefore more detailed.
|
In conclusion, this paper demonstrates that neural networks can predict protein contacts with an accuracy of 0.16 starting from the residue sequence and not including correlated mutations. With this approach, prediction is improved as compared with that of a random predictor by a factor of 6, which is higher than that previously evaluated with other methods. This level of accuracy is still not sufficient for protein prediction. However, in terms of information content our approach is `orthogonal' (not including correlated mutations) to those already reported, and it can be used in combination with them in order to increase the predictive performance of protein contacts.
| Acknowledgments |
|---|
This work was supported partially by a grant for a target project in Biotechnology from the Italian Centro Nazionale per le Ricerche (C.N.R.) and by a grant from Ministero della Università e della Ricerca Scientifica e Tecnologica (MURST) delivered to the project `Biocatalisi e Bioconversioni'.
| Notes |
|---|
3 To whom correspondence should be addressed
| References |
|---|
|
|
|---|
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535542.[Web of Science][Medline]
Bohr,H., Bohr,J., Brunak,S., Cotterill,R., Fredholm,H., Lautrup,B. and Petersen,S.B. (1990) FEBS Lett., 261, 4346.
Compiani,M., Fariselli,P., Martelli,P. and Casadio,R. (1998) Proc. Natl Acad. Sci. USA, 95, 92909294.
Fariselli,P., Compiani,M. and Casadio,R. (1993) Eur. Biophys J., 22, 4151.[Web of Science][Medline]
Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309317.[Web of Science][Medline]
Grossman,T., Farber,R. and Lapedes,A. (1995) ISMB, 3 154161.
Hinds,D.A and Levitt,M. (1992) Proc. Natl Acad. Sci. USA, 89, 25362540.
Hobohm,U., Scharf,M., Schneider,P. and Sander,C. (1992) Protein Sci., 1, 409417.[Web of Science][Medline]
Huang,E.S., Subbiah,S. and Levitt,M. (1985) J. Mol. Biol., 252, 709720.
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[Web of Science][Medline]
Lund,O., Frimand,K., Gorodkin,J., Bohr,H., Bohr,J., Hansen,J. and Brunak,S. (1997) Protein Engng, 10, 12411248.
Mairov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876888.[Web of Science][Medline]
Miyazawa,S. and Jerningam,R.L. (1985) Macromolecules, 18, 534552.[Web of Science]
Mirny,L.A. and Shaknovich,E.I. (1996) J. Mol. Biol., 264, 11641179.[Web of Science][Medline]
Mirny,L. and Domany,E. (1996) Proteins, 26, 391410.[Web of Science][Medline]
Olmea,O. and Valencia,A. (1997) Fold. Des., 2, S25-32.[Web of Science][Medline]
Rose,G.D., Gesolowittz,A.R., Lesser,G.J., Lee,R.H. and Zehfus,M.H. (1985) Science, 229, 834838.
Rost,B. and Sander,C. (1995) Proteins, 23, 295300.[Web of Science][Medline]
Rost,B. Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.[Web of Science][Medline]
Rumelhart,D.E., Hinton,G.E., Williams,R.J. (1986) Nature, 323, 533536.[Web of Science]
Sander,C. and Schneider R. (1991) Proteins, 9, 5668.[Web of Science][Medline]
Selbig,J. and Argos,P. (1998) Proteins, 31, 172185.[Web of Science][Medline]
Shindyalov,I.,N., Kolchanov,N.A. and Sander,C. (1994) Protein Engng, 7, 349358.
Sippl,M.J. (1990) J. Mol. Biol., 213, 859883.[Web of Science][Medline]
Thomas,D.J., Casari,G. and Sander,C. (1996) Protein Engng, 9, 941948.
Taylor,W.R. and Hatrick,K. (1994) Protein Engng, 7, 341348.
Vendruscolo,M., Kussel,E. and Domany,E. (1997) Fold. Des., 2, 295306.[Web of Science][Medline]
Received July 14, 1998; revised September 29, 1998; accepted October 7, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. N. Tegge, Z. Wang, J. Eickholt, and J. Cheng NNcon: improved protein contact map prediction using 2D-recursive neural networks Nucleic Acids Res., July 1, 2009; 37(suppl_2): W515 - W518. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. K. Ho, M. J. Kuiper, and R. Kotagiri PConPy--a Python module for generating 2D protein maps Bioinformatics, December 15, 2008; 24(24): 2934 - 2935. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wu and Y. Zhang A comprehensive assessment of sequence-based and template-based methods for protein contact prediction Bioinformatics, April 1, 2008; 24(7): 924 - 931. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Punta and B. Rost PROFcon: novel prediction of long-range contacts Bioinformatics, July 1, 2005; 21(13): 2960 - 2968. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Afonnikov and N. A. Kolchanov CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences Nucleic Acids Res., July 1, 2004; 32(suppl_2): W64 - W68. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. C. Saraf, G. L. Moore, and C. D. Maranas Using multiple sequence correlation analysis to characterize functionally important protein regions Protein Eng. Des. Sel., June 1, 2003; 16(6): 397 - 406. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Fariselli, O. Olmea, A. Valencia, and R. Casadio Prediction of contact maps with neural networks and correlated mutations Protein Eng. Des. Sel., November 1, 2001; 14(11): 835 - 843. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




= 0.06, for Lp > 170 <A> = 0.15 and 



