PEDS Advance Access originally published online on August 19, 2004
Protein Engineering Design and Selection 2004 17(7):565-570; doi:10.1093/protein/gzh065
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Filtering remote homologues using predicted structural information
1Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192 and 3CCSE, Japan Atomic Energy Research Institute, 81, Umemidai, Kizu-cho, Souraku, Kyoto 619-0215, Japan
2 To whom correspondence should be addressed. E-mail: takawaba{at}is.naist.jp
| Abstract |
|---|
|
|
|---|
Finding homologues for a given protein plays a major role in predicting the protein's structure and function. However, it is still difficult to find remote homologues with low sequence similarity, even with advanced sequence search methods. We propose a simple filtering method that uses predicted structural information, pertaining to secondary structures and solvent accessibilities. It filters the more promising homologues from the many candidate proteins obtained by PSI-BLAST with a less stringent threshold E-value. The final decision is made by a simple linear discrimination method, considering the E-value of PSI-BLAST and the statistical significance scores of structural matches. An in-house neural network program is used for the prediction of secondary structures and solvent accessibilities for both the query and library proteins. The performance of our filtering method was evaluated by the cross-validation method, using the SCOP superfamily relationship as the correct standard. Coveragereliability plots show that our filtering method clearly improves the performance of PSI-BLAST. The secondary structure improves PSI-BLAST better than the solvent accessibilities, but the combination of these two features with PSI-BLAST leads to the best result. The advantage of our method is its easy implementation with fewer parameters to be tuned and faster computation. We also discuss its performance with predicted and observed secondary structures.
Keywords: PSI-BLAST/remote homologue detection/secondary structure prediction/solvent accessibility prediction
| Introduction |
|---|
|
|
|---|
The detection of homologues for a given protein sequence is an essential step for predicting its tertiary structure and function. The demand for homologue detection has increased as a result of the huge number of protein sequences generated from genome sequencing projects. Since the 1990s, great efforts have been made to develop sensitive methods to detect increasingly remote homologues, based on ideas such as profile (Gribskov et al., 1987
In this study, following Geourjon et al.'s work, we developed a similar filtering method to choose more likely homologues from the many candidates obtained by PSI-BLAST, using predicted structural information. Compared with Geourjon et al.'s original work, we introduced a few refinements to the method. First, we employed a statistical significant score (Z-score) for structure matching. Second, in addition to the secondary structure predictions, we also used the solvent accessibility predictions. Third, we used a simple linear discrimination method, which combines E-value and the structural matching score.
| Materials and methods |
|---|
|
|
|---|
Dataset
To evaluate the performance of our methods, we used 3605 representative protein domains in the SCOP database (version 1.63) (Murzin et al., 1995
) with sequence identities of 30% or less. The domains of classes 6 (membrane and cell surface proteins and peptides), 8 (coiled coil proteins), 9 (low-resolution protein structures), 10 (peptides), 11 (designed proteins) and those with length <40 residues were removed from the representative list, because of their specific nature of evolution. The family and superfamily relationships defined in SCOP were considered to be the correct homologous relationships.
Overview of the method
Figure 1 shows an overview of our method. The secondary structures and solvent accessibilities for the query and library sequences are predicted. PSI-BLAST (Altschul et al., 1997
) is performed for the query protein against the SCOP representative sequence database and it outputs the homologue candidates with large threshold E-values and their alignments with the query sequence. Matching scores between the query and library predicted secondary structures/solvent accessibilities are calculated on the alignments. The final decision is made by considering the PSI-BLAST E-values and the structural matching scores.
|
The procedure for performing PSI-BLAST actually consists of two steps. In the first step, homologues of the query proteins are collected in the NR database downloaded from NCBI, with the maximum number of iterations = 5 and the threshold E-value = 0.001. After convergence, the final position-specific score matrices (PSSM) are saved. In the second step, using the PSSM matrix as the query, the homologue candidates in the representative SCOP database are searched with threshold E-values = 100.
Secondary structure prediction
An in-house program for secondary structure prediction was developed, using the standard neural network algorithm (Rumelhart et al., 1986
; Qian and Sejnowski, 1988
; Rost and Sander, 1993
; Jones 1999
). The output secondary structure included three-states [helix (H), strand (E) and others (C)]. The correct secondary structures were defined by the DSSP program (Kabsch and Sander, 1983
). We employed the network architecture proposed by Jones (1999)
, which is composed of two three-layered networks (cascaded network). The first network used PSI-BLAST PSSM as the input and generated preliminary predictions. The second network used the prediction of the first network as the inputs and yielded the final prediction. The input size was 13 residues and the number of hidden units was 30, for both the first and second networks. The program's performance was evaluated by the 7-fold cross-validation (Rost and Sander, 1993
), using the SCOP representative dataset. The prediction accuracy of our method was Q3 = 76.18%, which was the percentage of residues with correctly predicted three-state secondary structures, against all of the residues.
Solvent accessibility prediction
A neural network for solvent accessibility prediction was also developed, which outputs two states of accessibility: exposed (e) or buried (b). The correct answer was defined using the accessible surface area calculated by the DSSP program (Kabsch and Sander, 1983
). If the value of the accessible surface area was greater than 15% of the value for the standard extended conformation, then its accessibility was defined as exposed; otherwise, buried. After trials of various kinds of neural network architectures, we found that the second network was not effective for accessibility prediction. Finally, we employed the network with 13-residue PSSM inputs, with no hidden layer or second network. The accuracy of our network was Q2 = 73.74%, evaluated by the 7-fold cross-validation.
Z-score for structure matching
Based on the PSI-BLAST alignment, the degree of structural matching was measured for the three-state secondary structures and the two-state solvent accessibilities. Figure 2 shows an example of the secondary structure correspondence on the PSI-BLAST sequence alignment. We assumed that homologous protein pairs have more structural matches than non-homologous pairs. Measuring structural matches for this purpose is not a trivial problem. The structural identity Q-value (Q3 or Q2) may be the simplest way for measuring structural matches. The Q-value is defined by the number M of residue pairs with same structure divided by the number N of compared residues in the alignment between the query and the subject protein. However, such a Q-value tends to be high for a short alignment, even for non-homologous pairs. To solve this problem, Geourjon et al. (2001)
excluded the sequence pairs with <100 aligned residues from their datasets and evaluated the structural matching by the Sov score (Rost et al., 1994
; Zemla et al., 1999
). Instead of simply excluding short alignments, we introduced the following Z-score for evaluating, the statistical significance of matching structures, against random matches given by the binomial distribution:
![]() | (1) |
![]() | (2) |
and p for accessibility is
. For the pairs with the same Q-value, the Z-score is proportional to the square root of the compared residues, N. This property of the Z-score is helpful for excluding non-homologous pairs with a short alignment.
|
Linear discrimination using centroid vectors
A simple linear discrimination method using centroid vectors was introduced to make a final decision by considering several features, such as the E-value of PSI-BLAST and the Z-score from secondary structure/accessibility prediction. The final score of the linear discrimination method is the inner product S, between the input feature vector x and the projection vector w:
![]() | (3) |
![]() | (4) |
Coveragereliability plot
To evaluate the abilities of the various detection methods, coveragereliability plots were generated (Kawabata and Nishikawa, 2000
). Coverage and reliability are defined as follows:
![]() | (5) |
![]() | (6) |
Availability of software
Our software is available through a Web server (http://biunit.naist.jp/psisec/). It calculates a PSSM for a given target sequence using the PSI-BLAST, predicts its secondary structure from the PSSM by our neural network program, searches its homologues in the current PDB sequences using the PSSM and shows a combined result with the predicted secondary structure.
| Results |
|---|
|
|
|---|
Predicted structural information improves PSI-BLAST performance
Scatter plots of E-values and Z-scores for secondary structure prediction are shown in Figure 3. Basically, the homologous protein pairs (red dots) were distributed more in the lower E-value and higher Z-score region, as compared with the non-homologous pairs (green dots). The vector w connecting two means is also shown. The two features were combined into one by calculating an inner product with the vector w.
|
Coveragereliability plots for the various methods are shown in Figure 4, which clearly indicates that our filtering program, which considers matching of predicted secondary structures, has a better ability to recognize homologues than the original PSI-BLAST program. The Z-score for the solvent accessibilities improved the performance, but was less effective than that for the secondary structure. Combining the three features, E-value, Z-score for secondary structure and Z-score for solvent accessibility, yielded a slightly better result than that for E-value and Z-score for secondary structure.
|
Combination with two-way PSI-BLAST
It is well known that the E-values of PSI-BLAST are not symmetric: the E-value E(A, B) of protein A in a library, using protein B as a query, is often different from the E-value E(B, A) of protein B in a library, using protein A as a query. Using this asymmetry, the two-way PSI-BLAST method was proposed, which is reportedly more sensitive than the standard one-way PSI-BLAST (Teichman et al., 1999
; Kawabata et al., 2000
). In the two way PSI-BLAST method, the E-value for a pair of proteins A and B is evaluated by considering two PSI-BLAST searches:
![]() | (7) |
We examined the performance of our filtering method against the two-way PSI-BLAST method, by using a symmetrical Z-score for structural matching, defined as follows:
![]() | (8) |
The column labeled two-way in Figure 5 is the performance of our improved method based on the two-way PSI-BLAST method. As reported previously, the two-way PSI-BLAST coverage is larger than the one-way. Combination with the Z-score for solvent accessibility and secondary structure also improved the performance of the two-way PSI-BLAST method.
|
| Discussion |
|---|
|
|
|---|
Homologue detection using prediction with single sequence inputs
For predicting secondary structures and solvent accessibilities, neural networks with PSI-BLAST profile inputs were used in this study. To elucidate the relationship between the performance of homologue detection and the accuracy for secondary structure/solvent accessibility prediction, we employed less accurate methods, neural networks with single sequence inputs. For this purpose, we developed an in-house neural network program with the same architecture as used by Qian and Sejnowski (1988)
. We trained it using the SCOP representative datasets and evaluated by the cross-validation method. The prediction accuracies were lower than those with profile inputs. For secondary structure prediction, Q3 for the network with single sequence inputs was 68.59%, whereas Q3 for that with profile inputs was 76.18%. For solvent accuracy prediction, Q2 for the network with single sequence inputs was 68.24%, whereas Q2 for that with profile inputs was 73.74%. The performance of homologue detection using these prediction methods is summarized in Figure 6. The results show that the prediction methods using a single sequence improved PSI-BLAST, but not as much as the methods using profile inputs. This suggests that the prediction accuracy for secondary structure/solvent accessibility is crucial for the performance of our filtering method.
|
Performance of observedobserved and predictedobserved secondary structures
Basically, we used the predicted secondary structure/solvent accessibility for both the query and library proteins. However, when remote homologue detection is used for a structure prediction, the structures of the library proteins are already known, whereas that of the query protein needs to be predicted. In order to determine the effect using an observed structure, we examined the performance of various combinations of observed and predicted structures. Figure 7 shows the performance of three combinations of secondary structures: predicted structures for both query and library proteins (PrePre), predicted structures for query proteins and observed structures for library proteins (PreObs) and observed structures for both query and library proteins (ObsObs). It is reasonable that the performance of ObsObs is the best among the three. However, the performance of the predicted versus observed structures (PreObs) is not much better than that of the predicted versus predicted structure (PrePre). In other words, our filtering method using a predicted structure as a query worked equally for a structure-unknown library and a structure-known library. Although a similar result was reported by Geourjon et al. (2001)
, it is still not clear why introducing observed structures did not improve the performance of PrePre. Coincident prediction errors for the query and library proteins may explain the high performance of PrePre.
|
Limitations of the method and possible improvements
Compared with the other remote homologue detection methods, the advantage of our method is its easy implementation and fast computation. In addition, our evaluation of the performance is more reliable than those of previous, similar studies using ready-made secondary structure prediction programs, such as PHD (Rost and Sander, 1993
) and PSI-PRED (Jones, 1999
). This is because our in-house prediction programs can be trained by ourselves and we applied the cross-validation evaluation for them. However, we are aware of the limitations of our strategy. First, homologous pairs with large E-values of PSI-BLAST cannot be found by our filtering method, because our method completely depends on PSI-BLAST to provide homologue candidates. Second, our method just filters the PSI-BLAST results, it cannot improve the alignments. The aligned sequences of the homologous pairs detected by our filtering methods were often too short (data not shown). This was simply because PSI-BLAST alignments with larger E-values tend to be shorter. We now plan to introduce the dynamic programming program to realign only the homologue candidates found by PSI-BLAST, using predicted secondary structures. This may enhance the sensitivity and provide better alignments, without introducing large computational costs.
| Acknowledgments |
|---|
This work was supported by the Special Coordination Funds Promoting Science and Technology and a Grant-in-Aid for Scientific Research on Priority Area (C), Genome Information Science, from MEXT (Ministry of Education, Culture, Sports, Science and Technology, Japan).
| References |
|---|
|
|
|---|
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,H., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Bindewald,E.B., Cestaro,A., Hesser,J., Heiler,M. and Tosatto,S.C.E. (2003) Protein Eng., 16, 785789.
Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) Science, 253, 164170.
De la Cruz,X. and Thornton,J.M. (1999) Protein Sci., 8, 750759.[Web of Science][Medline]
Di Francesco,V., Munson,P.J. and Garnier,J. (1999) Bioinformatics, 15, 131140.
Eddy,S. (1998) Bioinformatics, 14, 755763.
Fischel-Goldsian,F., Mathiowitz,G. and Smith,T.F. (1990) Protein Eng., 3, 577581.
Fischer,D. and Eisenberg,D. (1996) Protein Sci., 5, 947955.[Web of Science][Medline]
Geetha,V., Di Francisco,V., Garnier,J. and Munson,P.J. (1999) Protein Eng, 12, 527534.
Geourjon,C., Combet,C., Blanchet,C. and Deleage,G. (2001) Protein Sci., 10, 788797.[CrossRef][Web of Science][Medline]
Ginalski,K., Pas,J., Wyrwicz,L.S., von Grotthuss,M., Bujnicki,J.M. and Rychlewski,L. (2003) Nucleic Acids Res., 31, 38043807.
Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 43554358.
Hargbo,J. and Elofsson,A. (1999) Proteins, 36, 6876.[CrossRef][Web of Science][Medline]
Jones,D.T. (1999) J. Mol. Biol., 292, 195202.[CrossRef][Web of Science][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[CrossRef][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[CrossRef][Web of Science][Medline]
Kawabata,T. and Nishikawa,K. (2000) Proteins, 41, 108122.[CrossRef][Web of Science][Medline]
Kawabata,T, Arisaka,F. and Nishikawa,K. (2000) Gene, 259, 223233.[CrossRef][Web of Science][Medline]
Kelly,L.A., MacCallum,R.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 499520.[Web of Science][Medline]
McGuffin,L.J. and Jones,D.T. (2003) Bioinformatics, 19, 874881.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[CrossRef][Web of Science][Medline]
Qian,N. and Sejnowski,J. (1988) J. Mol. Biol., 202, 865884.[CrossRef][Web of Science][Medline]
Rice,D.W. and Eisenberg,D. (1997) J. Mol. Biol., 267, 10261038.[CrossRef][Web of Science][Medline]
Rost,B. and Sander,C. (1993). J. Mol. Biol., 232, 584599.[CrossRef][Web of Science][Medline]
Rost B., Sander,C. and Schneider,R. (1994) J. Mol. Biol., 235, 1326.[CrossRef][Web of Science][Medline]
Rost,B., Schneider R. and Sander,C. (1997) J. Mol. Biol., 270, 417480.
Rumelhart,D.E., Hinton,G.E. and Williams,R.J. (1986) Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA, pp. 318362.
Russel,B.R., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349365.[CrossRef][Web of Science][Medline]
Shan,Y., Wang,G. and Zhou,H.-X. (2001) Proteins, 42, 2337.[CrossRef][Web of Science][Medline]
Teichmann,S.A., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390399.[CrossRef][Web of Science][Medline]
Wallqvist,A., Fukunishi,Y., Murphy,L.R., Fadel,A. and Levy,R.M. (2000) Bioinformatics, 16, 9881002.
Zemla,A., Venclovas,C., Fidelis,K. and Rost,B. (1999) Proteins, 34, 220223.[CrossRef][Web of Science][Medline]
Received February 20, 2004; revised August 1, 2004; accepted August 3, 2004.
Edited by Fred Cohen
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














