Protein Engineering, Vol. 14, No. 7, 455-458,
July 2001
© 2001 Oxford University Press
COMMUNICATION |
Sequence-based detection of distantly related proteins with the same fold
Department of Chemistry and E. O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA 1 Present address: Sugen Inc., 230 East Grand Avenue, South San Francisco, CA 94080, USA
| Introduction |
|---|
|
|
|---|
Because proteins that have diverged beyond significant sequence similarity still retain the three-dimensional (3D) fold of their ancestor (Chothia and Lesk, 1986
First, it has long been recognized that the use of multiply-aligned sequences from a protein family improves the sensitivity of homology detection. This idea is used by many recent computational procedures that exploit evolutionary information to uncover subtle sequence similarity. Examples of such procedures include sequence profiles (Gribskov et al., 1987
), consensus templates or motifs (Taylor, 1986
; Bairoch, 1991
; Tatusov et al., 1994
; Yi and Lander, 1994
), position-specific scoring matrices (PSSMs) (Henikoff and Henikoff, 1997
), profile hidden Markov models (Eddy, 1998
), and intermediate sequence methods (Holm and Sander, 1997
; Neuwald et al., 1997
; Park et al., 1997
). PSI-BLAST (Altschul et al., 1997
), one of the most widely used of these procedures, employs an iterative profile search strategy that combines the advantages of both PSSM and intermediate sequence methods. This program has been used effectively by several groups to assign 3D folds to predicted genome products (Teichmann et al., 1999
).
Second, proteins having the same fold also by definition have very similar secondary structures. In the light of the improved accuracy of secondary structure prediction (Rost and Sander, 1993
), several groups have attempted to use sequence-derived predictions to improve the sensitivity of fold recognition (Fischer and Eisenberg, 1996
; Russel et al., 1996
; Di Francesco et al., 1997
; Rice and Eisenberg, 1997
; Rost et al., 1997
). These methods usually represent each protein in a template library by a one-dimensional (1D) string of symbols (profiles) each representing a distinctive 3D structural state, and then use dynamic programming (Needleman and Wunsch, 1970
) to align the predicted structural profiles of the query sequence against the observed profiles in the template library. In some tests, these prediction-based 1D threading methods have given results comparable to the more complex and computationally expensive 3D threading methods. This approach can be extended to include cases where structural information is not available for either sequence. A simple procedure that is based solely on predicted secondary structure has also produced some interesting results (Aurora and Rose, 1998
).
A third source of information, which has received less attention in the context of sequence comparison, is the local environments of residues. The local structural state of a residue in a protein is heavily influenced by the local interactions between amino acids surrounding the residue. The idea that, to a large extent, local sequence codes for local structure is behind virtually all modern secondary structure prediction algorithms (Cuff and Barton, 1999
) as well as a new approach to tertiary structure prediction (Simons et al., 1997
; Bystroff and Baker, 1998
). Such a `local strategy' was introduced into sequence alignment only recently (Grigoriev and Kim, 1999
). This study suggested that, for distantly related proteins, local sequence segments do contain more information about structural relatedness than individual residues.
In this work, we combine the three sources of sequence information into a multicomponent alignment procedure, and assess its performance in the detection of distantly related proteins with the same fold. By comparing with a common reference, a basic pair-wise sequence alignment method that uses the generic amino acid substitution matrix BLOSUM62 (Henikoff and Henikoff, 1992
), we also examine the contributions made by individual components and various combinations of these components to improved sensitivity of the method. Our analysis shows that evolutionary information presented by multiple sequence alignment is the major source of improvement, and combining evolutionary profiles for both sequences being compared provides better results than using the profile of only one sequence. Moreover, a combination of predicted secondary structure and local sequence information leads to a further gain in the sensitivity.
| Methods |
|---|
|
|
|---|
The benchmark
To test the performance of the sequence-based methods for the detection of fold relationship, we used a manually derived protein structural classification database, SCOP (version 1.48) (Murzin et al., 1995
), as the standard. We started with the protein domain set provided by the authors of SCOP, which includes one domain per protein family, and compiled the sequences of these 1128 domains into a fold library. All domains with fewer than 50 residues were removed; so were the domains representing the sole member of a fold. The remaining 786 protein domains belong to 183 folds, and each domain has at least one other domain in the data set that shares the same fold (the complete data set is available from the authors upon request). On average, domains with the same fold are 23% identical at the amino acid sequence level (standard deviation 5%), falling below the twilight zone for reliable recognition of homology. We used these 786 protein domains as the query sequences to search the fold library for remote homologs.
Compatibility functions
The key component of a sequence alignment algorithm is a score function, fij , that describes the compatibility between position i of the first sequence (i.e. a query sequence) and position j of the second sequence (i.e. a template sequence from the library). Here we use a simple equation:
![]() | (1) |
The first definition of gij uses the generic substitution matrix BLOSUM62 (Henikoff and Henikoff, 1992
),
![]() | (2) |
The second definition of gij replaces BLOSUM in Equation (2)
with the PSSM produced by a PSI-BLAST search of the query sequence against a non-redundant sequence data set NRDB90 (Holm and Sander, 1998
):
![]() | (3) |
The third definition of gij uses PSSM for both sequence i and sequence j:
![]() | (4) |
For convenience, we call Equation (3)
the single PSSM function and Equation (4)
the dual PSSM function.
The effect of neighboring residues on the compatibility between residue i and residue j was taken into account by a simple local averaging strategy. Depending on whether BLOSUM62, single PSSM, or dual PSSM is used, three different forms of gij are defined,
![]() | (5) |
![]() | (6) |
![]() | (7) |
Now we turn to the hij term in Equation (1)
. An alignment that relies solely on sequence similarity (i.e. without using predicted secondary structure) corresponds to
![]() | (8) |
When sequence-derived secondary structure predictions are included into the alignment, hij takes the form of
![]() | (9) |
(·) is a Dirichlet function that takes value 1 only if the predicted secondary structural states (
-helix, ß-strand and coil), ssi and ssj, match. The predicted secondary structures were obtained by running the PSIPRED program (Jones, 1999Alignment and scoring
The `global' alignment algorithm of Needleman and Wunsch (1970), with no penalties for terminal gaps, was used. Each of the 786 query sequences was aligned to every template sequence in the library. Compatibility functions with all combinations of the gij and hij functions defined in the last section were tested, with gap penalties optimized independently for each case. We employed a heuristic scheme (Grigoriev and Kim, 1999
) to determine whether a template had the same fold as the query. In this scheme, the alignment scores of the query with all templates in the library were used to calculate the average score (
) and standard deviation (
S ). The original alignment scores were then converted into Z scores by
![]() | (10) |
Alignments with Z scores above zero were kept for further analysis. These Z scores were first ranked from the highest to the lowest, and then the gaps between consecutive Z scores were computed. Starting from Z = 0 towards the highest Z score, the first gap that exceeded a pre-defined cutoff
was used to separate real homologs (with Z scores greater than the upper bound of the gap) from false positives (with Z scores smaller than the lower bound of the gap). Our previous study (Grigoriev and Kim, 1999
) showed that this heuristic scheme gave better results than the commonly used Z score hard cutoff. By varying
, we can examine the tradeoff between the completeness (or coverage) of the prediction (i.e. how many true remote homologs are predicted) and the accuracy of the prediction (i.e. how many predicted homologs are true homologs).
| Results |
|---|
|
|
|---|
We tested the 12 different compatibility functions for the detection of fold relationship between proteins with low sequence similarity. Common to each test were the list of query sequences and the template sequence library. For each compatibility function, we computed the total number of matches between proteins that truly belong to the same fold (i.e. true matches) at a given error rate (i.e. the total number of false matches). The results are presented in Figure 1
), all other compatibility functions offer better sensitivity, although the degree of improvement varies from function to function.
|
Of the three sources of information, whether or not including evolutionary information has the most significant impact, as the basic sequence-based method, single and dual PSSM-based methods are clearly separated on the coverageerror rate plot (Figure 1
In most biology settings that employ computational predictions, limited errors often can be tolerated. For a prediction to be useful, the allowed error rate may vary from application to application. For the purpose of discussion, we select an error rate that corresponds to a point where the number of true matches to be included is just about to be exceeded by the number of false matches that would also be included if the threshold used to separate the true homologs from the false positives were to be decreased further. Although this definition is not rigorous mathematically, it mimics the criterion that most biologists use to decide how much further to go down a list returned by homology search software. For most methods tested here, this error rate corresponds to 13 false matches. At such an error rate, the basic pair-wise alignment method correctly predicts 48 protein pairs with the same fold. Although the use of predicted secondary structure or average sequence similarity individually provides only moderate gain in sensitivity (55 and 49, respectively), the combination of the two identifies 38 more pairs than the basic method (78% improvement, Figure 1a
).
The basic PSSM method (
) detects 130 protein pairs, at an error rate of 13. Including predicted secondary structures raises the number of true matches to 148. A further gain of 10 pairs is attained by the additional use of average sequence similarity. The average sequence similarity by itself has a similar effect to the predicted secondary structure (150 pairs).
The basic dual PSSM method (
) recognizes 153 true pairs at the expense of 13 false positives. Forty-six (30%) more pairs were detected when both predicted secondary structure and average sequence similarity were employed. The improvement from using only one of them is lower, 166 pairs for predicted secondary structure and 177 pairs for average similarity, respectively.
Within all three groups, the compatibility functions that incorporate both predicted secondary structure and average sequence similarity (i.e.
and
) outperform the basic sequence alignment methods as well as those incorporating one of the components. Overall, if a small error rate is allowed, the best method developed here (i.e.
) can detect at least one protein with the same fold for ~30% of the query proteins being considered.
| Discussion |
|---|
|
|
|---|
In this study, we developed a multicomponent sequence alignment procedure for the detection of proteins with the same fold that combines information derived from different methods. We showed that this procedure enhances substantially the sensitivity of the conventional sequence alignment method, and detects 250% more remote homologs with 100% specificity. When a small error rate is allowed, the method can find proteins with the same fold in one-third of the tested cases.
Among all sources of information evaluated, profiles derived from multiple sequence alignments offer the largest improvement to sensitivity, especially when profiles from both the query and the template are used. A recent comparison of several fold recognition algorithms also showed that evolutionary information contributed most to the prediction accuracy of methods that combine 1D threading with sequence alignment (Jaroszewski et al., 1998
). In previous studies, PSSMs were used either for query alone (Altschul et al., 1997
) or template (Schaffer et al., 1999
) alone, or together but in two separated alignment runs (Kelley et al., 2000
). Here we show that the PSSMs of query and template can be integrated into a single sequence alignment run to provide better performance.
Another interesting finding of this study is that using segments of protein sequences instead of single residues enhances the performance of sequence comparison for detecting proteins with the same fold. The average similarities between protein sequence segments work better than similarity between individual residues. Consistent with the early findings (Fischer and Eisenberg, 1996
; Rost et al., 1997
; De La Cruz and Thornton, 1999
; Geetha et al., 1999
), our results showed that the use of predicted secondary structure improves the performance of the sequence alignment methods. However, it is the combination of predicted secondary structure and average sequence similarity that provides the best performance in fold recognition. Since secondary structural states of individual residues are also dependent on local sequence context, this result suggests that similarities between sequence segments contain valuable information on ancestral relations.
The practical value of the proposed multicomponent procedure comes from the fact that it outperforms the basic profile-based method at low error rates (Figure 1
). In contrast, most current methods that claim to detect more remote homologs than the sequence comparison method do so only at an impractically high error rate (for example, see Rice and Eisenberg, 1997
). Note that, as a proof of principle, we have used a very simple strategy to combine secondary structure and local sequence information with evolutionary information, and used the PSSM output of PSI-BLAST directly with no further adjustment. Rychlewski and co-workers have recently shown that the details of profile calculation has a strong influence on its sensitivity in recognizing distant homologies (Rychlewski et al., 2000
). We expect that, by experimenting with different combining schemes and profile calculations, the performance of the multicomponent method would be further improved.
| Notes |
|---|
2 To whom correspondence should be addressed at: 220 Melvin Calvin Laboratory, University of California, Berkeley, CA 94720-5230, USA.E-mail: shkim{at}lbl.gov
| Acknowledgments |
|---|
This work was supported by grants from the Department of Energy (DE-AC03-76SF00098) and the NSF (97-23352). The research used the resources of the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| References |
|---|
|
|
|---|
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Aurora,R. and Rose,G.D. (1998) Proc. Natl Acad. Sci. USA, 95, 28182823.
Bairoch,A. (1991) Nucleic Acids Res., 19, 22412245.
Bystroff,C. and Baker,D. (1998) J. Mol. Biol., 281, 565577.[Web of Science][Medline]
Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823826.[Web of Science][Medline]
Cuff,J.A. and Barton,G.J. (1999) Proteins: Struct. Funct. Genet., 34, 508519.[Web of Science][Medline]
De La Cruz,X. and Thornton,J.M. (1999) Protein Sci., 8, 750759.[Web of Science][Medline]
Di Francesco,V., Garnier,J. and Munson,P.J. (1997) J. Mol. Biol., 267, 446463.[Web of Science][Medline]
Eddy,S.R. (1998) Bioinformatics, 14, 755763.
Fischer,D. and Eisenberg,D. (1996) Protein Sci., 5, 947955.[Web of Science][Medline]
Geetha,V., Di Francesso,V., Garnier,J. and Munson,P.J. (1999) Protein Eng., 12, 527534.
Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 43554358.
Grigoriev,I.V. and Kim,S.-H. (1999) Proc. Natl Acad. Sci. USA, 96, 1431814323.
Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 1091510919.
Henikoff,S. and Henikoff,J.G. (1997) Protein Sci., 6, 698705.[Web of Science][Medline]
Holm,L. and Sander,C. (1997) Proteins: Struct. Funct. Genet., 28, 7282.[Web of Science][Medline]
Holm,L. and Sander,C. (1998) Bioinformatics, 14, 423429.
Jaroszewski,L., Rychlewski,L., Zhang,B. and Godzik,A. (1998) Protein Sci., 7, 14311440.[Web of Science][Medline]
Jones,D.T. (1999) J. Mol. Biol., 292, 195202.[Web of Science][Medline]
Kelley,L.A., MacCallum,R.M. and Sternberg,M.J. (2000) J. Mol. Biol., 299, 499520.[Web of Science][Medline]
Murzin,A.G. (1998) Curr. Opin. Struct. Biol., 8, 380387.[Web of Science][Medline]
Murzin,A., Brenner,S.E., Hubband,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[Web of Science][Medline]
Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443453.[Web of Science][Medline]
Neuwald,A.F., Liu,J.S., Lipman,D.J. and Lawrence,C.E. (1997) Nucleic Acids Res., 25, 16651677.
Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, 349354.[Web of Science][Medline]
Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, 12011210.[Web of Science][Medline]
Rice,D.W. and Eisenberg,D. (1997) J. Mol. Biol., 267, 10261038.[Web of Science][Medline]
Rost,B. (1997) Fold. Des., 2, S19S24.[Web of Science][Medline]
Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[Web of Science][Medline]
Rost,B., Schneider,R. and Sander,C. (1997) J. Mol. Biol., 270, 471480.[Web of Science][Medline]
Russel,R.B., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349365.[Web of Science][Medline]
Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Protein Sci., 9, 232241.[Web of Science][Medline]
Schaffer,A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) Bioinformatics, 15, 10001011.
Simons,K.T., Kooperberg,C., Huang,E. and Baker,D. (1997) J. Mol. Biol., 268, 209225.[Web of Science][Medline]
Tatusov,R.L., Altschul,S.F. and Koonin,E.V. (1994) Proc. Natl Acad. Sci. USA, 91, 1209112095.
Taylor,W.R. (1986) J. Mol. Biol., 188, 233258.[Web of Science][Medline]
Teichmann,S.A., Chothia,C. and Gerstain,M. (1999) Curr. Opin. Struct. Biol., 9, 390399.[Web of Science][Medline]
Yi,T.M. and Lander,E.S. (1994) Protein Sci., 3, 13151328.[Web of Science][Medline]
Received August 2, 2000; revised March 19, 2001; accepted May 8, 2001.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












