Skip Navigation

This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Grigoriev, I. V.
Right arrow Articles by Kim, S.-H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Grigoriev, I. V.
Right arrow Articles by Kim, S.-H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Protein Engineering, Vol. 14, No. 7, 455-458, July 2001
© 2001 Oxford University Press


COMMUNICATION

Sequence-based detection of distantly related proteins with the same fold

Igor V. Grigoriev1, Chao Zhang and Sung-Hou Kim,2

Department of Chemistry and E. O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA 1 Present address: Sugen Inc., 230 East Grand Avenue, South San Francisco, CA 94080, USA


    Introduction
 Top
 Introduction
 Methods
 Results
 Discussion
 References
 
Because proteins that have diverged beyond significant sequence similarity still retain the three-dimensional (3D) fold of their ancestor (Chothia and Lesk, 1986Go; Rost, 1997Go), the recognition of structural similarity between proteins provides powerful clues to ancestry. In fact, a large number of distant homology relationships were identified only after the structures of the proteins had been solved (Murzin, 1998Go). However, structures are being determined only for a small fraction of the proteins. There is a pressing need for improvement in the performance of sequence-based methods for the detection of proteins with the same fold but scant sequence similarity. Here, we examine how to achieve this goal by combining three kinds of information from a protein sequence.

First, it has long been recognized that the use of multiply-aligned sequences from a protein family improves the sensitivity of homology detection. This idea is used by many recent computational procedures that exploit evolutionary information to uncover subtle sequence similarity. Examples of such procedures include sequence profiles (Gribskov et al., 1987Go), consensus templates or motifs (Taylor, 1986Go; Bairoch, 1991Go; Tatusov et al., 1994Go; Yi and Lander, 1994Go), position-specific scoring matrices (PSSMs) (Henikoff and Henikoff, 1997Go), profile hidden Markov models (Eddy, 1998Go), and intermediate sequence methods (Holm and Sander, 1997Go; Neuwald et al., 1997Go; Park et al., 1997Go). PSI-BLAST (Altschul et al., 1997Go), one of the most widely used of these procedures, employs an iterative profile search strategy that combines the advantages of both PSSM and intermediate sequence methods. This program has been used effectively by several groups to assign 3D folds to predicted genome products (Teichmann et al., 1999Go).

Second, proteins having the same fold also by definition have very similar secondary structures. In the light of the improved accuracy of secondary structure prediction (Rost and Sander, 1993Go), several groups have attempted to use sequence-derived predictions to improve the sensitivity of fold recognition (Fischer and Eisenberg, 1996Go; Russel et al., 1996Go; Di Francesco et al., 1997Go; Rice and Eisenberg, 1997Go; Rost et al., 1997Go). These methods usually represent each protein in a template library by a one-dimensional (1D) string of symbols (profiles) each representing a distinctive 3D structural state, and then use dynamic programming (Needleman and Wunsch, 1970Go) to align the predicted structural profiles of the query sequence against the observed profiles in the template library. In some tests, these prediction-based 1D threading methods have given results comparable to the more complex and computationally expensive 3D threading methods. This approach can be extended to include cases where structural information is not available for either sequence. A simple procedure that is based solely on predicted secondary structure has also produced some interesting results (Aurora and Rose, 1998Go).

A third source of information, which has received less attention in the context of sequence comparison, is the local environments of residues. The local structural state of a residue in a protein is heavily influenced by the local interactions between amino acids surrounding the residue. The idea that, to a large extent, local sequence codes for local structure is behind virtually all modern secondary structure prediction algorithms (Cuff and Barton, 1999Go) as well as a new approach to tertiary structure prediction (Simons et al., 1997Go; Bystroff and Baker, 1998Go). Such a `local strategy' was introduced into sequence alignment only recently (Grigoriev and Kim, 1999Go). This study suggested that, for distantly related proteins, local sequence segments do contain more information about structural relatedness than individual residues.

In this work, we combine the three sources of sequence information into a multicomponent alignment procedure, and assess its performance in the detection of distantly related proteins with the same fold. By comparing with a common reference, a basic pair-wise sequence alignment method that uses the generic amino acid substitution matrix BLOSUM62 (Henikoff and Henikoff, 1992Go), we also examine the contributions made by individual components and various combinations of these components to improved sensitivity of the method. Our analysis shows that evolutionary information presented by multiple sequence alignment is the major source of improvement, and combining evolutionary profiles for both sequences being compared provides better results than using the profile of only one sequence. Moreover, a combination of predicted secondary structure and local sequence information leads to a further gain in the sensitivity.


    Methods
 Top
 Introduction
 Methods
 Results
 Discussion
 References
 
The benchmark

To test the performance of the sequence-based methods for the detection of fold relationship, we used a manually derived protein structural classification database, SCOP (version 1.48) (Murzin et al., 1995Go), as the standard. We started with the protein domain set provided by the authors of SCOP, which includes one domain per protein family, and compiled the sequences of these 1128 domains into a fold library. All domains with fewer than 50 residues were removed; so were the domains representing the sole member of a fold. The remaining 786 protein domains belong to 183 folds, and each domain has at least one other domain in the data set that shares the same fold (the complete data set is available from the authors upon request). On average, domains with the same fold are 23% identical at the amino acid sequence level (standard deviation 5%), falling below the twilight zone for reliable recognition of homology. We used these 786 protein domains as the query sequences to search the fold library for remote homologs.

Compatibility functions

The key component of a sequence alignment algorithm is a score function, fij , that describes the compatibility between position i of the first sequence (i.e. a query sequence) and position j of the second sequence (i.e. a template sequence from the library). Here we use a simple equation:

(1)
to combine the compatibility measure for the two sequences (gij ) and the compatibility measure for the respective predicted secondary structures (hij). Six different definitions of gij and two definitions of hij are considered.

The first definition of gij uses the generic substitution matrix BLOSUM62 (Henikoff and Henikoff, 1992Go),

(2)
where aai and aaj denote the amino acids at position i of the query and position j of the template, respectively.

The second definition of gij replaces BLOSUM in Equation (2)Go with the PSSM produced by a PSI-BLAST search of the query sequence against a non-redundant sequence data set NRDB90 (Holm and Sander, 1998Go):

(3)
where PSSMi denotes the PSSM at position i. E-value cutoff (0.0001) was used for the PSI-BLAST search, and the maximum number of iterations was set to seven.

The third definition of gij uses PSSM for both sequence i and sequence j:

(4)

For convenience, we call Equation (3)Go the single PSSM function and Equation (4)Go the dual PSSM function.

The effect of neighboring residues on the compatibility between residue i and residue j was taken into account by a simple local averaging strategy. Depending on whether BLOSUM62, single PSSM, or dual PSSM is used, three different forms of gij are defined,

(5)

(6)
and

(7)
where 2L + 1 defines the size of the window that encloses the center residue.

Now we turn to the hij term in Equation (1)Go. An alignment that relies solely on sequence similarity (i.e. without using predicted secondary structure) corresponds to

(8)
which gives rise to the first definition of hij.

When sequence-derived secondary structure predictions are included into the alignment, hij takes the form of

(9)
where {delta}(·) is a Dirichlet function that takes value 1 only if the predicted secondary structural states ({alpha}-helix, ß-strand and coil), ssi and ssj, match. The predicted secondary structures were obtained by running the PSIPRED program (Jones, 1999Go). Because the score for a secondary structure mismatch (0) is higher than the average score of BLOSUM62 or PSSM, the above definition would bias the alignment in favor of secondary structure mismatches over weak alignment at the matched secondary structure regions. To correct this bias, we shifted the BLOSUM62 and PSSM such that the average score of the element in the matrix equals 0.

Alignment and scoring

The `global' alignment algorithm of Needleman and Wunsch (1970), with no penalties for terminal gaps, was used. Each of the 786 query sequences was aligned to every template sequence in the library. Compatibility functions with all combinations of the gij and hij functions defined in the last section were tested, with gap penalties optimized independently for each case. We employed a heuristic scheme (Grigoriev and Kim, 1999Go) to determine whether a template had the same fold as the query. In this scheme, the alignment scores of the query with all templates in the library were used to calculate the average score () and standard deviation ({sigma}S ). The original alignment scores were then converted into Z scores by

(10)

Alignments with Z scores above zero were kept for further analysis. These Z scores were first ranked from the highest to the lowest, and then the gaps between consecutive Z scores were computed. Starting from Z = 0 towards the highest Z score, the first gap that exceeded a pre-defined cutoff {varepsilon} was used to separate real homologs (with Z scores greater than the upper bound of the gap) from false positives (with Z scores smaller than the lower bound of the gap). Our previous study (Grigoriev and Kim, 1999Go) showed that this heuristic scheme gave better results than the commonly used Z score hard cutoff. By varying {varepsilon}, we can examine the tradeoff between the completeness (or coverage) of the prediction (i.e. how many true remote homologs are predicted) and the accuracy of the prediction (i.e. how many predicted homologs are true homologs).


    Results
 Top
 Introduction
 Methods
 Results
 Discussion
 References
 
We tested the 12 different compatibility functions for the detection of fold relationship between proteins with low sequence similarity. Common to each test were the list of query sequences and the template sequence library. For each compatibility function, we computed the total number of matches between proteins that truly belong to the same fold (i.e. true matches) at a given error rate (i.e. the total number of false matches). The results are presented in Figure 1Go. Compared with the reference, the basic pair-wise sequence alignment method (i.e. ), all other compatibility functions offer better sensitivity, although the degree of improvement varies from function to function.





View larger version (57K):
[in this window]
[in a new window]
 
Fig. 1. . Coverage versus error rate plot. For each compatibility function, we calculated the number of true homologous pairs detected for a given error rate (i.e. number of false matches, ranging from 0 to 50). The coverage–error rate curves for four BLOSUM62-based functions, four single PSSM-based functions and four dual PSSM-based functions are separated into four groups and are shown in (a), (b) and (c), respectively. Dotted lines and crosses represent the basic sequence alignment methods; dashed lines and circles represent methods that include predicted secondary structure; dash–dot lines and triangles represent methods that use average sequence similarity; and solid lines and diamonds represent methods that use both predicted secondary structure and average sequence similarity.

 
Of the three sources of information, whether or not including evolutionary information has the most significant impact, as the basic sequence-based method, single and dual PSSM-based methods are clearly separated on the coverage–error rate plot (Figure 1Go). The combined use of the profiles for both query and template sequences performs better than the use of a single profile. At zero error rate (i.e. 100% specificity), where the effect of the predicted secondary structure and averaged sequence information is not very significant, the basic pair-wise alignment method recognizes 38 protein pairs with the same fold. Introducing the profile for the query sequence finds 56 more protein pairs. If the profiles for both query and template sequences are used, a total of 132 pairs can be recognized (corresponding to a 250% improvement).

In most biology settings that employ computational predictions, limited errors often can be tolerated. For a prediction to be useful, the allowed error rate may vary from application to application. For the purpose of discussion, we select an error rate that corresponds to a point where the number of true matches to be included is just about to be exceeded by the number of false matches that would also be included if the threshold used to separate the true homologs from the false positives were to be decreased further. Although this definition is not rigorous mathematically, it mimics the criterion that most biologists use to decide how much further to go down a list returned by homology search software. For most methods tested here, this error rate corresponds to 13 false matches. At such an error rate, the basic pair-wise alignment method correctly predicts 48 protein pairs with the same fold. Although the use of predicted secondary structure or average sequence similarity individually provides only moderate gain in sensitivity (55 and 49, respectively), the combination of the two identifies 38 more pairs than the basic method (78% improvement, Figure 1aGo).

The basic PSSM method ( ) detects 130 protein pairs, at an error rate of 13. Including predicted secondary structures raises the number of true matches to 148. A further gain of 10 pairs is attained by the additional use of average sequence similarity. The average sequence similarity by itself has a similar effect to the predicted secondary structure (150 pairs).

The basic dual PSSM method () recognizes 153 true pairs at the expense of 13 false positives. Forty-six (30%) more pairs were detected when both predicted secondary structure and average sequence similarity were employed. The improvement from using only one of them is lower, 166 pairs for predicted secondary structure and 177 pairs for average similarity, respectively.

Within all three groups, the compatibility functions that incorporate both predicted secondary structure and average sequence similarity (i.e. and ) outperform the basic sequence alignment methods as well as those incorporating one of the components. Overall, if a small error rate is allowed, the best method developed here (i.e. ) can detect at least one protein with the same fold for ~30% of the query proteins being considered.


    Discussion
 Top
 Introduction
 Methods
 Results
 Discussion
 References
 
In this study, we developed a multicomponent sequence alignment procedure for the detection of proteins with the same fold that combines information derived from different methods. We showed that this procedure enhances substantially the sensitivity of the conventional sequence alignment method, and detects 250% more remote homologs with 100% specificity. When a small error rate is allowed, the method can find proteins with the same fold in one-third of the tested cases.

Among all sources of information evaluated, profiles derived from multiple sequence alignments offer the largest improvement to sensitivity, especially when profiles from both the query and the template are used. A recent comparison of several fold recognition algorithms also showed that evolutionary information contributed most to the prediction accuracy of methods that combine 1D threading with sequence alignment (Jaroszewski et al., 1998Go). In previous studies, PSSMs were used either for query alone (Altschul et al., 1997Go) or template (Schaffer et al., 1999Go) alone, or together but in two separated alignment runs (Kelley et al., 2000Go). Here we show that the PSSMs of query and template can be integrated into a single sequence alignment run to provide better performance.

Another interesting finding of this study is that using segments of protein sequences instead of single residues enhances the performance of sequence comparison for detecting proteins with the same fold. The average similarities between protein sequence segments work better than similarity between individual residues. Consistent with the early findings (Fischer and Eisenberg, 1996Go; Rost et al., 1997Go; De La Cruz and Thornton, 1999Go; Geetha et al., 1999Go), our results showed that the use of predicted secondary structure improves the performance of the sequence alignment methods. However, it is the combination of predicted secondary structure and average sequence similarity that provides the best performance in fold recognition. Since secondary structural states of individual residues are also dependent on local sequence context, this result suggests that similarities between sequence segments contain valuable information on ancestral relations.

The practical value of the proposed multicomponent procedure comes from the fact that it outperforms the basic profile-based method at low error rates (Figure 1Go). In contrast, most current methods that claim to detect more remote homologs than the sequence comparison method do so only at an impractically high error rate (for example, see Rice and Eisenberg, 1997Go). Note that, as a proof of principle, we have used a very simple strategy to combine secondary structure and local sequence information with evolutionary information, and used the PSSM output of PSI-BLAST directly with no further adjustment. Rychlewski and co-workers have recently shown that the details of profile calculation has a strong influence on its sensitivity in recognizing distant homologies (Rychlewski et al., 2000Go). We expect that, by experimenting with different combining schemes and profile calculations, the performance of the multicomponent method would be further improved.


    Notes
 
2 To whom correspondence should be addressed at: 220 Melvin Calvin Laboratory, University of California, Berkeley, CA 94720-5230, USA.E-mail: shkim{at}lbl.gov Back


    Acknowledgments
 
This work was supported by grants from the Department of Energy (DE-AC03-76SF00098) and the NSF (97-23352). The research used the resources of the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory, Berkeley, CA, USA.


    References
 Top
 Introduction
 Methods
 Results
 Discussion
 References
 
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Aurora,R. and Rose,G.D. (1998) Proc. Natl Acad. Sci. USA, 95, 2818–2823.[Abstract/Free Full Text]

Bairoch,A. (1991) Nucleic Acids Res., 19, 2241–2245.

Bystroff,C. and Baker,D. (1998) J. Mol. Biol., 281, 565–577.[Web of Science][Medline]

Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823–826.[Web of Science][Medline]

Cuff,J.A. and Barton,G.J. (1999) Proteins: Struct. Funct. Genet., 34, 508–519.[Web of Science][Medline]

De La Cruz,X. and Thornton,J.M. (1999) Protein Sci., 8, 750–759.[Web of Science][Medline]

Di Francesco,V., Garnier,J. and Munson,P.J. (1997) J. Mol. Biol., 267, 446–463.[Web of Science][Medline]

Eddy,S.R. (1998) Bioinformatics, 14, 755–763.[Abstract/Free Full Text]

Fischer,D. and Eisenberg,D. (1996) Protein Sci., 5, 947–955.[Web of Science][Medline]

Geetha,V., Di Francesso,V., Garnier,J. and Munson,P.J. (1999) Protein Eng., 12, 527–534.[Abstract/Free Full Text]

Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 4355–4358.[Abstract/Free Full Text]

Grigoriev,I.V. and Kim,S.-H. (1999) Proc. Natl Acad. Sci. USA, 96, 14318–14323.[Abstract/Free Full Text]

Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 10915–10919.[Abstract/Free Full Text]

Henikoff,S. and Henikoff,J.G. (1997) Protein Sci., 6, 698–705.[Web of Science][Medline]

Holm,L. and Sander,C. (1997) Proteins: Struct. Funct. Genet., 28, 72–82.[Web of Science][Medline]

Holm,L. and Sander,C. (1998) Bioinformatics, 14, 423–429.[Abstract/Free Full Text]

Jaroszewski,L., Rychlewski,L., Zhang,B. and Godzik,A. (1998) Protein Sci., 7, 1431–1440.[Web of Science][Medline]

Jones,D.T. (1999) J. Mol. Biol., 292, 195–202.[Web of Science][Medline]

Kelley,L.A., MacCallum,R.M. and Sternberg,M.J. (2000) J. Mol. Biol., 299, 499–520.[Web of Science][Medline]

Murzin,A.G. (1998) Curr. Opin. Struct. Biol., 8, 380–387.[Web of Science][Medline]

Murzin,A., Brenner,S.E., Hubband,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[Web of Science][Medline]

Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443–453.[Web of Science][Medline]

Neuwald,A.F., Liu,J.S., Lipman,D.J. and Lawrence,C.E. (1997) Nucleic Acids Res., 25, 1665–1677.[Abstract/Free Full Text]

Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, 349–354.[Web of Science][Medline]

Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, 1201–1210.[Web of Science][Medline]

Rice,D.W. and Eisenberg,D. (1997) J. Mol. Biol., 267, 1026–1038.[Web of Science][Medline]

Rost,B. (1997) Fold. Des., 2, S19–S24.[Web of Science][Medline]

Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584–599.[Web of Science][Medline]

Rost,B., Schneider,R. and Sander,C. (1997) J. Mol. Biol., 270, 471–480.[Web of Science][Medline]

Russel,R.B., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349–365.[Web of Science][Medline]

Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Protein Sci., 9, 232–241.[Web of Science][Medline]

Schaffer,A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) Bioinformatics, 15, 1000–1011.[Abstract/Free Full Text]

Simons,K.T., Kooperberg,C., Huang,E. and Baker,D. (1997) J. Mol. Biol., 268, 209–225.[Web of Science][Medline]

Tatusov,R.L., Altschul,S.F. and Koonin,E.V. (1994) Proc. Natl Acad. Sci. USA, 91, 12091–12095.[Abstract/Free Full Text]

Taylor,W.R. (1986) J. Mol. Biol., 188, 233–258.[Web of Science][Medline]

Teichmann,S.A., Chothia,C. and Gerstain,M. (1999) Curr. Opin. Struct. Biol., 9, 390–399.[Web of Science][Medline]

Yi,T.M. and Lander,E.S. (1994) Protein Sci., 3, 1315–1328.[Web of Science][Medline]

Received August 2, 2000; revised March 19, 2001; accepted May 8, 2001.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Grigoriev, I. V.
Right arrow Articles by Kim, S.-H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Grigoriev, I. V.
Right arrow Articles by Kim, S.-H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?