Skip Navigation


PEDS Advance Access originally published online on March 13, 2006
Protein Engineering Design and Selection 2006 19(5):187-193; doi:10.1093/protein/gzj018
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow supplementary data
Right arrowOA All Versions of this Article:
19/5/187    most recent
gzj018v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Ngan, S.-C.
Right arrow Articles by Samudrala, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ngan, S.-C.
Right arrow Articles by Samudrala, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved.

A knowledge-based scoring function based on residue triplets for protein structure prediction

Shing-Chung Ngan, Michael T. Inouye and Ram Samudrala1

Computational Genomics Group, Department of Microbiology, University of Washington School of Medicine, Seattle, WA 98195, USA

1 To whom correspondence should be addressed. E-mail: ram{at}compbio.washington.edu


    Abstract
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
One of the general paradigms for ab initio protein structure prediction involves sampling the conformational space such that a large set of decoy (candidate) structures are generated and then selecting native-like conformations from those decoys using various scoring functions. In this study, based on a physical/geometric approach first suggested by Banavar and colleagues, we formulate a knowledge-based scoring function, which uses the radii of curvature formed among triplets of residues in a protein conformation. By analyzing its performance on various decoy sets, we determine a good set of parameters—the distance cutoff and the number of distance bins—to use for configuring such a function. Furthermore, we investigate the effect of using various approaches for compiling the prior distribution on the performance of the knowledge-based function. Possible extensions to the current form of the residue triplet scoring function are discussed.

Keywords: ab initio prediction/Bayesian/protein structure


    Introduction
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
In protein structure prediction, a given sequence with one or more known homologs whose conformations have been experimentally determined can be modeled with comparative modeling techniques (Blundell et al., 1987Go; Bajorath et al., 1994Go; Johnson et al., 1994Go; Sali 1995Go; Sanchez and Sali, 1997Go). On the other hand, a sequence with no obvious homologs is often modeled using ab initio methods (Friesner and Gunn, 1996Go; Jones 1997Go; Levitt et al., 1999Go). One of the general paradigms for ab initio structure prediction involves sampling the conformational space such that a large set of ‘decoy’ structures are generated and then selecting native-like conformations from those decoys using various scoring functions (Samudrala et al., 1999Go; Samudrala and Levitt, 2002Go). Since the first papers on protein structure prediction appeared some 30 years ago, both conformational space sampling and scoring function design have remained as major challenges in ab initio structure prediction to this day (Moult et al., 1997Go, 1999Go, 2001Go, 2003Go).

There are two broad categories of scoring functions. The first category of functions are largely based on some aspects of the known physics of molecular interaction, such as the van der Waals force, electrostatics, and the bending and torsional forces, to determine the energy of a particular conformation (Brooks et al., 1983Go; Weiner et al., 1986Go; Jorgensen and Tirado-Rives, 1988Go; Nemethy et al., 1992Go; Cornell et al., 1995Go; MacKerell et al., 1998Go). The second category of functions are knowledge-based. Each of these knowledge-based functions tries to capture some aspects of the protein native conformations, such as the tendency of a certain amino acid to be exposed or buried relative to the solvent, or to be part of the helix, strand or coil local structure and so on. These knowledge-based functions are compiled based on the statistics of a database of experimentally determined protein structures (Wodak and Rooman, 1993Go; Sippl 1995Go; DeBolt and Skolnick, 1996Go; Gilis and Rooman, 1996Go; Jernigan and Bahar, 1996Go; Zhang et al., 1997Go; Samudrala and Moult, 1998Go). Interaction between these two categories of functions has resulted in a fertile ground for the experimentation and construction of new scoring functions. In this study, based on a physical/geometric approach first suggested by Banavar and colleagues (Maritan et al., 2000Go; Banavar et al., 2002Go, 2003aGo, bGo), we formulate and analyze an analogous knowledge-based scoring function (denoted as the residue triplet scoring function), which involves the radii of curvatures formed among triplets of residues in a protein conformation. We also investigate the effect of using various approaches for compiling the prior distribution on the performance of the knowledge-based function.

The paper is organized as follows. We first briefly review the physical/geometric approach of Banavar and colleagues. We then describe the construction of a knowledge-based scoring function which incorporates some key features from that of Banavar et al. The performance of the knowledge-based function in structure prediction is evaluated through its application to 41 decoy sets of various quality. Finally, we propose some possible extensions to the current form of the scoring function.


    Theoretical background and methods
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
The three-body potential of Banavar et al.

In Maritan et al. (2000)Go and Banavar et al. (2002Go, 2003aGo, bGo), Banavar and colleagues viewed a protein chain as a system of discrete particles and considered interactions among any three particles through a three-body potential. By drawing a circle through any given three particles, the radius of curvature could be determined and was used as the input variable to the potential function. In their Monte-Carlo simulation of protein chain folding, a Lennard-Jones type function was chosen as the potential. It was demonstrated that protein-like structures, such as short segments of helices with special pitch-to-radius ratio, sheets and hairpins, were naturally obtained as ground states in their simulations.

A knowledge-based formulation of the three-body potential

Our formulation of the knowledge-based residue triplet potential is analogous to the standard pairwise residue distance-dependent scoring function, with two main modifications. First, the two-body potential in the pairwise case is replaced by a three-body potential. Second, the pairwise residue distances, which form inputs to the score calculation for a given conformation, are replaced by the radii of curvature of residue triplets. It should be noted that a residue triplet does not necessarily consist of three residues consecutive in sequence, just as a residue pair does not necessarily correspond to a pair of neighboring residues in the two-body potential. Precisely, in terms of the Bayesian statistics formalism as described in Samudrala and Moult (1998)Go, we view a given set of conformations for a protein sequence as comprising of two subsets: a subset of correct conformations {C} and a subset of incorrect conformations {I}. For a given conformation, we calculate the probability that it belongs to the subset of correct structures {C}, given some properties of the conformation. In our present case, these properties are the set of distances Formula, where Formula is the radius of curvature formed by residues i, j and k of residue types a, b and c. The probability is denoted as Formula. Using Bayes' theorem, one obtains

Formula 1(1)
where Formula 1 is the (posterior) probability of observing the set of radii of curvature Formula 1 in a correct structure, Formula 1 is the (prior) probability of observing such a set of radii in any correct or incorrect structure and P(C) is the probability that any structure picked at random is a member of the correct set. To ensure computational feasibility, we make a simplifying assumption that the radii are independent of one another:

Formula 2(2)

Combining Equations (1) and (2) gives

Formula 3(3)
Equation (3) suggests a scoring function S, which is proportional to the negative log conditional probability that the given structure is correct, given a set of radii of curvature:

Formula 4(4)

Before one can use Equation (4) as a scoring function, the statistics for the posterior probability Formula 4 and the prior probability Formula 4 need to be compiled. Specifically, to compute the statistics for Formula 4, we tabulate the radii of curvature generated by residue triplets in a set of experimentally determined conformations available from the Protein Data Bank (PDB) (Westbrook et al., 2003Go; Bourne et al., 2004Go). This set of conformations was created by first selecting all proteins that appear in the e-value filtered ASTRAL SCOP genetic domain sequence subset list with the threshold e-value set at 10–4 (Chandonia et al., 2004Go). Subsequently, we retained proteins whose lengths are less than 300 residues (primarily for computational efficiency) and removed proteins whose PSI-BLAST e-values are less than 2 with respect to a set of 41 protein sequences we later use for test decoy set generation and scoring function testing. This results in a total of 3150 structures (hereafter denoted as the database of solved protein structures). We then evaluate the quantity

Formula 5(5)
where N(rabc) is the number of occurrences of triplets with residue types a, b and c whose radius of curvature is in the distance bin r. For compilation of the statistics of Formula 5, we attempt three approaches in this study. In the first approach, for each protein sequence in the database of the solved protein structures, we use an ab initio conformational space sampling protocol to generate 10 decoy structures, as a result yielding a total of 3150 x 10 = 31 500 decoy structures (hereafter denoted as the database of decoy structures). The ab initio conformational space sampling protocol consists of a Monte-Carlo method with simulated annealing procedure, with move set based on the standard fragment replacement scheme, namely, the existing conformation of three consecutive residues at a random position is replaced by the torsion values of three residues with identical sequence from an experimentally determined structure (Simons et al., 1997Go; Hung and Samudrala, 2003Go). The energy function used to generate the decoys is a combination of the all-atom distance-dependent function, a hydrophobic compactness function and a bad contacts function (Samudrala et al., 1999Go; Samudrala and Levitt, 2002Go). We use the database of the 31 500 decoy structures to determine the prior distribution P(rabc) analogous to the way the database of the solved structures is used in Equation (5) for the posterior distribution:

Formula 6(6)
As a second approach, we apply the mixture method described in Samudrala and Moult (1998)Go, i.e. instead of using the database of the 31 500 decoy structures, the database of the 3150 solved structures is employed and averaging is done across the various residue types when determining the prior distribution. Specifically, P(rabc) is calculated by

Formula 7(7)
where Formula 7 is the number of contacts among all residue triplets in a particular distance bin r in the database of the solved structures, regardless of residue types. Finally, as a third approach, Equation (7) is again employed to compile the statistics of the prior distribution, i.e. averaging is again performed across the various residue types. However, the compilation is done on the database of the 31 500 decoy structures, instead of the 3150 solved structures.

Generation of test decoy sets and evaluation of the residue triplet scoring function

To evaluate the performance of the residue triplet scoring function in distinguishing native-like from non-native conformations, we apply it to 41 test decoy sets of various quality. The 41 test decoy sets correspond to 41 protein sequences, some of them taken from the second through fifth Community Wide Experiments on the Critical Assessment of Techniques for Protein Structure Prediction (Moult et al., 1997Go, 1999Go, 2001Go, 2003Go) and the rest randomly picked from the PDB. Each decoy is generated using the same conformational space sampling protocol described in the preceding sub-section. Each run consists of 100 000 iterations using the fragment replacement move set and yields 10 decoys at the end of the run. One thousand seeds are used to generate 10 000 decoys for each test decoy set.

Table I gives the PDB identifiers and the SCOP classifications of the 41 protein sequences used in generating the test decoy sets. Also included is the C{alpha} root mean squared deviation (RMSD) of the best decoy relative to the corresponding native structure in each test set. Among them, 15 test decoy sets have their best structures below 6 Å C{alpha} RMSD relative to their native conformations. (Twenty-four decoy sets have their best structures below 7 Å C{alpha} RMSD relative to their native conformations.) We denote those 15 sets as the high quality test decoy sets.


View this table:
[in this window]
[in a new window]
 
Table I.. List of the protein sequences used in generating the test decoy sets

 
We use two measures to evaluate the quality of the residue triplet scoring function. This first measure is the enrichment ratio. After the scoring function is applied to a test decoy set, we count the number of decoys (denoted as a) which are in the top 10% both in terms of their residue triplet scores and their C{alpha} RMSD relative to the native structures. The expected number in a random distribution is 10% x 10% x {number of decoys in the set} (denoted as b). The enrichment ratio is a/b. A value above 1 indicates enrichment over the random distribution. The second measure is obtained via the receiver-operating characteristic (ROC) analysis. A decoy structure is a priori classified as true positive if its C{alpha} RMSD relative to the native structure is in the top 10% among all the decoys in the test set. The lower 90% decoy structures are classified as true negative. After the residue triplet score has been computed for each decoy in a test set, we start with the best scoring decoy and expand the collection of the ‘native-like’ decoys by adding one decoy at a time. The true positive fraction and the false positive fraction (FPF) are determined for each successive step and plotted against each other to generate the ROC curve. The area under a truncated ROC curve (with 0 ≤ FPF ≤ 0.1 in this study) generated by the residue triplet scoring function (denoted as As), divided by the expected area under a truncated ROC curve corresponding to the random distribution (denoted as Ar), indicates the improvement of the scoring function over the random distribution. The percentage improvement is simply 100% x (AsAr)/Ar.

Selection of the distance cutoff and the number of distance bins

Before one can compile the statistics for the posterior probability P(rabc|C) and the prior probability P(rabc) using Equations (5Go7), the distance cutoff, the number of distance bins and the bin sizes have to be fixed. It is not clear a priori what the best values for these parameters are. Thus, we try a number of possibilities in this study. Distance cutoffs from 12 to 16 Å and numbers of bins ranging from 4 to 11 are tested. Bin widths are determined in the following manner: Figure 1 depicts the distribution of the radius of curvature for triplets (regardless of residue types) observed in the database of the solved structures. If, for example, we fix a cutoff distance of 15 Å and the number of bins to be five, then we choose the bin widths in such a way that each bin will have approximately equal area underneath the distribution curve, holding roughly the same number of observed radii. There are of course other ways to sub-divide the bin sizes. We perform the subdivision in this particular manner mainly to restrict the search space for finding reasonably good parameter values.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1.. Distribution of the radii of curvature for all triplets. Triplets are obtained from a database of solved protein structures and are considered regardless of residue types. In this example, we sub-divide the area under the distribution curve into 5 bins, with the distance cutoff at 15 Å. Each bin has approximately equal area, which means that they hold roughly the same number of observed radii.

 

    Results and discussion
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
A good parameter set for configuring the residue triplet scoring function

Figures 2a, 3a and 4a illustrate the various enrichment ratios that the residue triplet scoring functions produce and Figures 2b, 3b and 4b show the corresponding percentage improvement in the truncated ROC measure. (Figures 2Go4 extract and summarize data in Supplementary Tables I–III, respectively available at PEDS online.) Figure 2 illustrates the performance of the scoring functions [haereafter denoted as the residue specific decoy structure based triplet (RSDT) functions] that employ a residue type specific compilation of the prior distribution P(rabc) derived from the database of the 31 500 decoy structures. In Figure 3, the scoring functions [hereafter denoted as the residue non-specific solved structure based triplet (RNST) functions] use a residue type non-specific compilation of the prior distribution derived from the database of the 3150 solved structures. In Figure 4, a residue type non-specific compilation of the prior distribution derived from the database of the 31 500 decoy structures is employed in constructing the scoring functions [hereafter denoted as the residue non-specific decoy structure based triplet (RNDT) functions].


Figure 2
View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2.. Performance of the RSDT functions. Shown are (a) the average enrichment ratios and (b) the percentage improvement in the ROC measure achieved by the RSDT functions when they are applied to the high quality test decoy sets. The RSDT functions are constructed with a residue type specific compilation of the prior distribution derived from the database of 31 500 decoy structures. Distance cutoff ranging from 12 to 16 Å and the number of bins ranging from 4 to 11 are examined. Configurations with a distance cutoff of 14 Å with 7 distance bins and with a distance cutoff 14 Å with 8 distance bins give the best results.

 

Figure 3
View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3.. Performance of the RNST functions. Shown are (a) the average enrichment ratios and (b) the percentage improvement in the ROC measure achieved by the RNST functions when they are applied to the high quality test decoy sets. The RNST functions are constructed with a residue type non-specific compilation of the prior distribution derived from the database of 3150 solved structures. Distance cutoff ranging from 13 to 15 Å and the number of bins ranging from 4 to 11 are examined. Comparing with Figure 2, we observe that the RNST scoring functions generally have lower performances.

 

Figure 4
View larger version (15K):
[in this window]
[in a new window]
 
Fig. 4.. Performance of the RNDT functions. Shown are (a) the average enrichment ratios and (b) the percentage improvement in the ROC measure achieved by the RNDT functions when they are applied to the high quality test decoy sets. The RNDT functions are constructed with a residue type non-specific compilation of the prior distribution derived from the database of 31 500 decoy structures. Distance cutoff ranging from 13 to 15 Å and the number of bins ranging from 4 to 11 are examined. Comparing with Figure 2, we observe that the RNDT scoring functions generally have lower performances.

 
Overall, focusing on the performances of the scoring functions on the high quality test decoy sets (i.e. the 15 test decoy sets that contain structures of less than 6 Å C{alpha} RMSD relative to the native conformations), by comparing Figures 2(a and b), 3(a and b) and 4(a and b), we see that a good set of parameters for the residue triplet scoring function is a distance cutoff of 14 Å with 7 distance bins (alternatively, a distance cutoff of 14 Å with 8 bins also gives similar performance) and with the prior distribution P(rabc) generated with a residue type specific compilation of the database of the 31 500 decoy structures. This produces an enrichment ratio of ~1.33 and an ROC improvement of ~45%. Analysis based on the standard leave-one-out cross-validation yields similar results, with an average enrichment ratio of 1.32 and an average ROC improvement of 42%. For test decoy sets of lesser quality, this particular configuration of the scoring function maintains the overall enrichment ratio above 1.21 and the ROC improvement above 30% [numerical values detailed in Supplementary Tables Ia(ii–v) and Ib(ii–v) available at PEDS online].

Choice of the prior distribution

By inspecting Figures 2Go4, we observe that switching from using a prior distribution P(rabc) generated with a residue type specific compilation of the database of decoy structures, to the one generated with a residue non-specific compilation of the database of solved structures and the one generated with a residue non-specific compilation of the database of decoy structures, depresses the performance of the residue triplet scoring function in general. For example, for the high quality decoy sets, the best enrichment ratios are ~1.15 (Figure 3a) and ~1.18 (Figure 4a) and the best ROC improvements are ~17% (Figure 3b) and ~23% (Figure 4b) for the functions configured with the latter two prior distributions. These values are lower than the enrichment ratio of 1.33 and the ROC improvement of 45% for the RSDT function.

The best performing RSDT, RNST and RNDT scoring functions are selected from Figures 2Go4 and their enrichment ratios and ROC percentage improvements are plotted in Figures 5 and 6 across test decoy sets of various quality. The performance differences among the RSDT, RNST and RNDT functions depicted in these two figures indicate that a residue type specific derivation of the prior distribution can boost the accuracy of the scoring function over one based on a residue type non-specific derivation. Furthermore, according to the figures, the performance of the RNDT function seems to be slightly better than that of the RNST function. This observation suggests the importance of using the same conformational space sampling protocol for creating test decoy sets as well as for generating the database of decoy structures for prior distribution derivation, at least in the context of constructing the residue triplet scoring function. Despite the above-mentioned disadvantage, the RNST scoring function is still useful in instances where a priori information about the conformational space sampling protocol used in generating the test decoy sets is either not known or not utilized, since only the database of solved structures is needed in compiling the statistics of the prior distribution. A good way to further explore and understand the comparative effectiveness of the three approaches for prior distribution estimation is to study them in the context of other knowledge-based functions (for example, in the construction of the pairwise residue distance-dependent scoring function).


Figure 5
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 5.. Performance of the various types of residue triplet scoring functions. Triplet functions are evaluated using the average enrichment ratios on test decoy sets of various quality. For example, the circle at coordinate (6 Å,1.332) indicates that the RSDT function configured with a distance cutoff of 14 Å and 7 distance bins achieves an average enrichment ratios of 1.332 for the test decoy sets that contain structures of less than 6 Å C{alpha} RMSD relative to the native conformations. From Figure 3a, we select the best performing RNST scoring function. The left-pointing triangles in the current figure indicate the average enrichment ratios achieved by that function. The best performing RNDT scoring function is analogously chosen from Figure 4a, represented by the stars in current figure. We also include the performance of one other scoring function in the figure. The downward pointing triangles correspond to the all-atom distance-dependent conditional probability discriminatory function, a two-body potential. Overall, the RSDT functions give the best performances.

 

Figure 6
View larger version (22K):
[in this window]
[in a new window]
 
Fig. 6.. Performance of the various types of residue triplet scoring functions. Triplet functions are evaluated using the average ROC percent improvements on test decoy sets of various quality. For example, the circle at coordinate (6 Å,44.5%) indicates that the RSDT function configured with a distance cutoff of 14 Å and 7 distance bins achieves an average percent improvement of 44.5% for the test decoy sets that contain structures of less than 6 Å C{alpha} RMSD relative to the native conformations. From Figure 3b, we select the best performing RNST scoring function. The left-pointing triangles in the current figure indicate the average percent improvement achieved by that function. The best performing RNDT function is analogously chosen from Figure 4b, represented by the stars in the current figure. We also include the performance of one other scoring function in the figure. The downward pointing triangles correspond to the all-atom distance-dependent conditional probability discriminatory function, a two-body potential. Overall, the RSDT functions give the best performances.

 
Comparing the performance of the residue triplet scoring function to other established functions

To provide a rough yardstick for measuring the performance of the residue triplet scoring function, we apply the all-atom distance-dependent conditional probability discriminatory function [denoted as the RAPDF function in Samudrala and Moult (1998)Go] to the 41 test decoy sets. The RAPDF function has been studied and compared with other functions in the literature [e.g. see Lu and Skolnick (2001)Go, de Bakker et al. (2003)Go and Zhang et al. (2004)Go]. In the present study, this function is compiled with the database of the 3150 solved structures. The resulting enrichment ratios and percentage improvements in the ROC measure for the RAPDF function are shown in Figures 5 and 6, respectively. These figures show that the residue triplet functions with the configuration of a distance cutoff of 14 Å with 7 bins and of a distance cutoff of 14 Å with 8 bins both perform reasonably well in comparison.

In addition, we also apply a local-triplet (LT) scoring function described in Lezon et al. (2004)Go to the test decoy sets. The LT function uses a specially designed five-letter alphabet to represent the Ramachandran angles and evaluates a given decoy with a two-step process, in which a sequence–structure and a structure–structure mapping of the LTs are performed. It has been shown to have produced good results in the fold recognition of coarse-grained protein tertiary structures. In the present study, for the high quality test decoy sets, this function yields an average enrichment ratio of 1.10 and an average ROC percent improvement of 18.1%. Comparing these results with Figures 5 and 6 again confirms that the residue triplet functions with the configuration of a distance cutoff of 14 Å with 7 bins and of a distance cutoff of 14 Å with 8 bins perform well.

Examination of low counts

In order for the posterior probabilities Formula 7 estimated with Equation (5) and the prior probabilities P(rabc) estimated with Equation (6) to be statistically meaningful, there needs to be sufficient counts for the denominator Formula 7 for each residue triplet type (a,b,c). Our results indicate that for the RSDT function with a distance cutoff of 14 Å and 7 distance bins, in the posterior probabilities estimation based on the database of the solved structures, the triplet type tryptophan–tryptophan–tryptophan has the count of 4717, the lowest among all triplet types. With 7 distance bins, this gives an average of ~674 counts per bin. For the prior probabilities estimation based on the database of the decoy structures, the triplet type tryptophan–tryptophan–tryptophan has the count of 48 177, again the lowest among all triplet types. With 7 distance bins, this gives an average of ~6882 counts per bin. Thus, in both cases, the counts are sufficiently high for Equations (5) and (6) to provide statistically valid estimates of the respective probabilities. Similar low count results are also obtained for the RNST and RNDT functions.


    Conclusion
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
In this study, we construct and analyze a residue triplet knowledge-based scoring function. The scoring function is inspired by the previous work of Banavar and colleagues, who studied chain folding using a physical/geometric approach in which the inputs to their Lennard-Jones type potential were the radii of curvature of residue triplets. Their computer simulations showed a number of interesting results, e.g. naturally obtaining ground states with protein-like local structures, such as helices with specific pitch-to-turn ratio, sheets and hairpins.

Our formulation of the residue triplet scoring function follows the standard approach used in constructing the pairwise residue distance-dependent potential, with two modifications: (i) the two-body potential is replaced by a three-body one and (ii) the pairwise distances are replaced by the radii of curvature corresponding to residue triplets. Three different approaches for estimating the prior distribution of the radius of curvature are tested. Also tested are the use of various distance cutoffs and numbers of bins in constructing the knowledge-based potential. To evaluate the performances of the various possible configurations, we generate 41 test decoy sets of different quality and apply the various configurations of the scoring function on the test decoy sets. Our numerical experiments show that a distance cutoff of 14 Å, with either 7 or 8 distance bins and with the statistics of the prior distribution of the radius of curvature derived from a database of decoy structures in a residue type specific manner, produces good results.

We discuss briefly some possible modifications and extensions to the current form of the residue triplet scoring function. First, instead of using a straight 14 Å distance cutoff across the different residue types, the distance cutoff can be chosen in a residue type specific manner. That is, for given residue triplets of specific residue types a, b and c, one can compile the statistics and observe the log-odd score S(rabc) of such a triplet type as a function of the radius of curvature rabc. A good cutoff value for the triplet type will correspond to the radius of curvature at which this function decays to zero. Second, the residue-based function can be augmented to an all-atom form. Using a detailed atomic description for protein confirmations may yield a more accurate scoring function for discriminating native-like from non-native conformations. Third, as suggested in Banavar et al. (2003b)Go, residue quadruplets instead of triplets can be used to construct an analogous scoring function. In such a case, the radius of curvature will be replaced by the radius of the sphere formed by four residues. Both the second and the third extensions require increased computing power, but they are still computationally tractable for small proteins with sizes <120 residues. Finally, we note that in the residue triplet formulation, a large radius of curvature can be generated either by three neighboring residues subtending an angle close to 180°, or by three residues distant from one another and forming an equilateral triangle. The fact that the two configurations are not distinguishable in the triplet formulation suggests that it is beneficial to combine the residue triplet scoring function with a two-body distance-based scoring function to further enhance the decoy discrimination ability. A detailed study of how to combine the residue triplet function with other potentials will be presented elsewhere.


    Acknowledgements
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
The authors thank all the members of the Samudrala Group and the anonymous reviewers for their insightful suggestions for improving the content of the manuscript. The authors also thank Mr Tim Lezon and Prof. Jayanth Banavar for their helpful discussions and for providing data files and programs for implementing their scoring function. This work is supported in part by a Searle Scholar Award, a NSF CAREER award, a NSF grant DBI-0217241 and a NIH grant GM068152-01 to R.S., as well as the University of Washington's Advanced Technology Initiative in Infectious Diseases. Funding to pay the Open Access publication charges for this article was provided by the Searle Award.


    References
 Top
 Abstract
 Introduction
 Theoretical background and...
 Results and discussion
 Conclusion
 Acknowledgements
 References
 
Bajorath,J., Stenkamp,R., Aruffo,A. (1994) Protein Sci., 2, 1798–1810.

Banavar,J.R., Maritan,A., Micheletti,C. and Trovato,A. (2002) Proteins, 47, 315–322.[CrossRef][Web of Science][Medline]

Banavar,J.R., Flammini,A., Marenduzzo,D., Maritan,A. and Trovato,A. (2003a) ComPlexUs, 1, 4–13.[CrossRef]

Banavar,J.R., Gonzalez,O., Maddocks,J.H. and Maritan,A. (2003b) J. Stat. Phys., 110, 35–50.[CrossRef]

Blundell,T.L., Sibanda,B.L., Sternberg,M.J.E. and Thornton,J.M. (1987) Nature, 326, 347–352.[CrossRef][Medline]

Bourne,P.E. et al. (2004) Nucleic Acids Res., 32, D223–D225.[Abstract/Free Full Text]

Brooks,B., Bruccoleri,R., Olafson,B., States,D., Swaminathan,S. and Karplus,M. (1983) J. Comput. Chem., 4, 187–217.

Chandonia,J.M., Hon,G., Walker,N.S., LoConte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2004) Nucleic Acids Res., 32, D189–D192.[Abstract/Free Full Text]

Cornell,W.D., Cieplak,P., Bayly,C.I., Gould,I.R., Merz,K.M.Jr, Fergusson,D.M., Spellmeyer,D.C., Fox,D.C., Caldwell,J.W. and Kollman,P.A. (1995) J. Am. Chem. Soc., 117, 5179–5197.[CrossRef]

de Bakker,P.I.W., DePristo,M.A., Burke,D.F. and Blundell,T.L. (2003) Proteins, 51, 21–40.[CrossRef][Web of Science][Medline]

DeBolt,S.E. and Skolnick,J. (1996) Protein Eng., 8, 637–655.

Friesner,R.A. and Gunn,J.R. (1996) Annu. Rev. Biophys. Biomol. Struct., 25, 315–342.[Web of Science][Medline]

Gilis,D. and Rooman,M. (1996) J. Mol. Biol., 257, 1112–1126.[CrossRef][Web of Science][Medline]

Hung,L.H. and Samudrala,R. (2003) Nucleic Acids Res., 31, 3296–3299.[Abstract/Free Full Text]

Jernigan,R.L. and Bahar I. (1996) Curr. Opin. Struct. Biol., 6, 195–209.[CrossRef][Web of Science][Medline]

Johnson,M.S., Srinivasan, N., Sowdhamini,R. and Blundell,T.L. (1994) Crit. Rev. Biochem. Mol. Biol., 29, 1–68.[Web of Science][Medline]

Jones,D.T. (1997) Curr. Opin. Struct. Biol., 7, 377–387.[CrossRef][Web of Science][Medline]

Jorgensen,W. and Tirado-Rives,J. (1988) J. Am. Chem. Soc., 110, 1657–1666.[CrossRef]

Levitt,M., Gerstein,M., Huang,E., Subbiah,S. and Tsai,J. (1999) Annu. Rev. Biochem., 66, 1368–1372.

Lezon,T., Banavar,J.R. and Maritan,A. (2004) Proteins, 55, 536–547.[Medline]

Lu,H. and Skolnick,J. (2001) Proteins, 44, 223–232.[CrossRef][Web of Science][Medline]

MacKerell,A.D. Jr et al. (1998) J. Phys. Chem. B, 102, 3586–3616.[CrossRef]

Maritan,A., Micheletti,C., Trovato,A. and Banavar,J. (2000) Nature, 406, 287–290.[CrossRef][Medline]

Moult,J., Hubbard,T., Bryant,S.H., Fidelis,K. and Pedersen,J.T. (1997) Proteins, 29, 2–6.[CrossRef]

Moult,J., Hubbard,T., Fidelis,K. and Pedersen,J.T. (1999) Proteins, 37, 2–6.[CrossRef][Medline]

Moult,J., Fidelis,K., Zemla,A. and Hubbard,T. (2001) Proteins, 45, 2–7.[Web of Science][Medline]

Moult,J., Fidelis,K., Zemla,A. and Hubbard,T. (2003) Proteins, 53, 334–339.

Nemethy,G., Gibson,K.D., Palmer,K.A., Yoon,C.N., Paterlini,G., Zagari,A., Rumsey,S. and Scheraga,H.A. (1992) J. Phys. Chem., 96, 6472–6484.[CrossRef]

Sali,A. (1995) Curr. Opin. Biotech., 6, 437–451.[CrossRef][Web of Science][Medline]

Samudrala,R. and Levitt,M. (2002) BMC Struct. Biol., 2, 3–18.[CrossRef][Medline]

Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 275, 895–916.[CrossRef][Web of Science][Medline]

Samudrala,R., Xia,Y., Levitt,M. and Huang E.S. (1999) In Altman,R., Dunker,K., Hunter,L., Klein,T. and Lauderdale,K. (eds), Proceedings of the Pacific Symposium on Biocomputing. World Scientific Press, Singapore, pp. 505–516.

Sanchez,R. and Sali,A. (1997) Curr. Opin. Struct. Biol., 7, 206–214.[CrossRef][Web of Science][Medline]

Simons,K.T., Kooperberg,C., Huang,E. and Baker,D. (1997) J. Mol. Biol., 268, 209–225.[CrossRef][Web of Science][Medline]

Sippl,M. (1995) Curr. Opin. Struct. Biol., 5, 229–235.[CrossRef][Web of Science][Medline]

Weiner,S., Kollman, P., Nguyen,D. and Case,D. (1986) J. Comput. Chem., 7, 230–252.[CrossRef][Web of Science]

Westbrook,J., Feng,Z., Chen,L., Yang,H. and Berman,H.M. (2003) Nucleic Acids Res., 31, 489–491.[Abstract/Free Full Text]

Wodak,S. and Rooman,M. (1993) Curr. Opin. Struct. Biol., 3, 247–259.

Zhang,C., Liu,S. and Zhou,Y. (2004) Protein Sci., 13, 391–399.[CrossRef][Web of Science][Medline]

Zhang,C., Vasmatzis,G., Cornette,J.L. and DeLisi,C. (1997) J. Mol. Biol., 267, 707–726.[CrossRef][Web of Science][Medline]

Received August 23, 2005; revised December 30, 2005; accepted January 9, 2006.

Edited by Janet Thornton


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow supplementary data
Right arrowOA All Versions of this Article:
19/5/187    most recent
gzj018v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Ngan, S.-C.
Right arrow Articles by Samudrala, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ngan, S.-C.
Right arrow Articles by Samudrala, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?