Skip Navigation



PEDS Advance Access published online on February 20, 2008

Protein Engineering Design and Selection, doi:10.1093/protein/gzn003
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/5/295    most recent
gzn003v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Tang, Y.-R.
Right arrow Articles by Zhang, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tang, Y.-R.
Right arrow Articles by Zhang, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oxfordjournals.org

An improved prediction of catalytic residues in enzyme structures

Yu-Rong Tang1, Zhi-Ya Sheng1,2, Yong-Zi Chen and Ziding Zhang3

Bioinformatics Center, College of Biological Sciences, China Agricultural University, Beijing 100094, China

3 To whom correspondence should be addressed. E-mail: zidingzhang{at}cau.edu.cn


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
The protein databases contain a huge number of function unknown proteins, including many proteins with newly determined 3D structures resulted from the Structural Genomics Projects. To accelerate experiment-based assignment of function, de novo prediction of protein functional sites, like active sites in enzymes, becomes increasingly important. Here, we attempted to improve the prediction of catalytic residues in enzyme structures by seeking and refining different encodings (i.e. residue properties) as well as employing new machine learning algorithms. In particular, considering that catalytic residues can often reveal specific network centrality when representing enzyme structure as a residue contact network, the corresponding measurement (i.e. closeness centrality) was used as one of the most important encodings in our new predictor. Meanwhile, a genetic algorithm integrated neural network (GANN) was also employed. Thanks to the above strategies, our GANN predictor demonstrated a high accuracy of 91.2% in the prediction of catalytic residues based on balanced datasets (i.e. the 1:1 ratio of catalytic to non-catalytic residues). When the GANN method was optimally applied to real enzyme structures, 73.9% of the tested structures had the active site correctly located. Compared with two existing methods, the proposed GANN method also demonstrated a better performance.

Keywords: catalytic residues/closeness centrality/genetic algorithm/neural network/prediction


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
Providing functional annotation is one of the major tasks in the field of protein bioinformatics nowadays, given the considerable accumulation of protein sequence and structure data (Shapiro and Harris, 2000Go; Gutteridge et al., 2003Go; Ofran et al., 2005Go). For a query enzyme, the identification of catalytic residues is one of the most important steps towards understanding its biological roles and exploring its applications. In particular, the identified catalytic residues can greatly help in performing enzyme-targeted drug design, understanding the catalytic mechanism of enzyme reactions and constructing metabolic pathways (Bartlett et al., 2002Go; Chou and Cai, 2004Go; Porter et al., 2004Go).

Sequence and structural similarity based methods are two classical bioinformatics strategies widely used to identify catalytic residues in a query enzyme. The sequence similarity based method requires the identification of homologous enzyme sequences with known catalytic residues. Subsequently, catalytic residues in an identified homolog can be transferred to the query sequence. However, in some cases such method can be misleading due to the fact that enzyme functions are less conserved (Todd et al., 2001Go; Rost, 2002Go; Tian and Skolnick, 2003Go). The structural similarity based method is also able to identify catalytic residues even when no clear sequence similarity is detectable, provided that the 3D structure for the query enzyme is available (Orengo et al., 1999Go). By mapping catalytic residues of a structural homolog into the query enzyme, such ‘structure-based functional annotation’ can offer in-depth insight by often highlighting 3D structural arrangements of catalytic residues. Even so, the power of structure-based annotation is often weakened by the fact that a similar fold does not necessarily imply a similar function (Nagano et al., 2002Go).

It has been well accepted that proteins without detectable sequence or structural similarity may have the same configuration of active sites for catalyzing similar reactions (i.e. convergent evolution) (Torrance et al., 2005Go; Zhang and Grigorov, 2006Go; Zhang and Tang, 2007Go). Complementary to sequence or structural similarity based methods, therefore, several methods focusing only on the local pattern of active sites and recognizing catalytic residues by comparing query structures with active site templates of known enzymes have been developed (Torrance et al., 2005Go; Goyal et al., 2007Go). With the accumulated enzyme structures deposited in the PDB database (Berman et al., 2000Go), sequence and structural characters of catalytic residues have been intensively investigated (Bartlett et al., 2002Go; Amitai et al., 2004Go; Bate and Warwicker, 2004Go; Ben-Shimon and Eisenstein, 2005Go; del Sol et al., 2006Go; Chea and Livesay, 2007Go). Meanwhile, de novo prediction methods (i.e. strategies independent of sequence alignment, structural comparison, or active-site matching) have also been developed to identify catalytic residues in enzyme structures. For example, some methods based on sequence or structural properties have been reported to achieve quite high accuracy (Chou and Cai, 2004Go; Ko et al., 2005Go), although these methods have only been tested on a specific enzyme family or a small number of proteins.

With the advantage of incorporating different sequence or structural properties into a predictor, machine learning algorithms such as artificial neural network (ANN) and support vector machine (SVM) have also been used for the de novo prediction of catalytic residues in heterogeneous enzymes (Gutteridge et al., 2003Go; Petrova and Wu, 2006Go; Youn et al., 2007Go). Compared with other machine learning based prediction tasks in the field of protein bioinformatics, this important topic is relatively less addressed and there is still enough room to improve.

In the present study, we focused our efforts to improve the prediction of catalytic residues based on the following two strategies. First, many available encoding schemes were evaluated to refine a subset of useful encodings. In particular, we transformed each enzyme structure into a residue interaction network (Greene and Higman, 2003Go), in which catalytic residues reveal specific closeness centrality (Amitai et al., 2004Go; del Sol et al., 2006Go). To our best knowledge, such information has not been incorporated in previously published machine learning based predictors. Secondly, in addition to SVM algorithm, a genetic algorithm integrated neural network (GANN) was also employed. The central idea of GANN is to use a genetic algorithm (GA) for optimizing the connection weights within neural networks. Compared with ANN trained with the standard back propagation algorithm, GANN generally can achieve a better performance in many applications (Cho, 1999Go; Fish et al., 2004Go; Tang et al., 2007Go). In our recent publication about the prediction of protein phosphorylation sites, GANN can even reveal a better performance than SVM (Tang et al., 2007Go). In this paper, we report in detail about how the above two strategies are considered together to improve the prediction of catalytic residues.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
Dataset

To facilitate a comparison of different predictors, the enzyme dataset originally compiled by Petrova and Wu (2006)Go was also used in the present study. Containing 79 protein domains, this dataset covered all 6 top level enzyme classifications (78 unique EC numbers) and 77 SCOP families. Since sequence redundancy was removed, there was no significant sequence similarity for any sequence pair within this dataset (Petrova and Wu, 2006Go). The corresponding PDB files for these 79 structures were retrieved from the SCOP database (Murzin et al., 1995Go) (release 1.71, http://scop.mrc-lmb.cam.ac.uk/scop/), and active site annotation was from the Catalytic Site Atlas (Porter et al., 2004Go) (http://www.ebi.ac.uk/thornton-srv/databases/CSA/), including 240 catalytic residues in all.

Encoding of residue properties

To construct a machine learning based catalytic residue predictor, residue properties must be converted into input feature vectors (i.e. encodings). Residue properties evaluated here covered residue type, sequence conservation, network centrality, relative position, hydrogen bonding, solvent accessibility, flexibility, and secondary structure. Properties represented by characters or strings including residue type, relative position, and secondary structure were converted into binary codes, while the rest real-number scores were directly used as the input of a predictor. More details about these encodings are described as follows.

Residue type Different amino acids evidently have different propensities to be catalytic residues (Bartlett et al., 2002Go). Two encodings were used to represent this property. The first encoding is named AA_Type20, in which each of the 20 amino acids was encoded with a 20-dimensional binary vector, e.g. A (1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0), C (0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0),...,Y (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1), etc. The second encoding called AA_Type3 was based on a three-type classification of 20 amino acids, in which charged (DEKHR), polar (CNQSTY), and hydrophobic residues (AFGILMPVW) were encoded as (0 0), (0 1), and (1 0), respectively.

Sequence conservation One of the most important characteristics of catalytic residues is that they are highly conserved. Generally, they are more conserved not only than the average residues, but also than other functional residues, such as the ones involved in binding substrates (Bartlett et al., 2002Go; Porter et al., 2004Go). To compute the conservation score for a residue, a BLAST searching (Altschul et al., 1997Go) for the corresponding sequence was performed against the NCBI non-redundant protein sequence database (the version of 09-03-2007) with a 10–5 E-value cut-off to obtain a multiple sequence alignment (MSA). The MSA was then submitted to the Scorecons server (Valdar, 2002Go) (http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/valdar/scorecons_server.pl) to score residue conservation with default parameters. Finally, the conservation score called cons was used as the sequence conservation based encoding. In some cases, the number of hits resulted from BLAST searching was more than 400. To accelerate the processing of the Scorecons server, Cd-hit (Li and Gozik, 2006) was used to filter these hits with an adjustable cut-off of sequence identity until the remained hits was less than 400. In other cases, the number of hits might be less than 10. To include enough sequences for a more reliable calculation of conservation, a three-iteration PSI-BLAST searching (Altschul et al., 1997Go) was run with an E-value cut-off of 10–20 to include sequence in the position specific scoring matrix model.

Network centrality Residues in or directly contacting with active site usually have more interactions with other residues, so centrality values of catalytic residues in the network representation of enzyme structures are typically high, especially the closeness centrality (Amitai et al., 2004Go; del Sol et al., 2006Go; Chea and Livesay, 2007Go). To materialize these network centrality based encodings, each structure was transformed into an undirected residue interaction graph. Residues were modeled as vertices in the graph, and an edge was added between a pair of vertices if the shortest distance between any pair of atoms from two residues was no more than 5.0 Å. In other words, residues i and j were considered to have an edge if at least one atom from residue i was at a distance of ≤5.0 Å to an atom from residue j. With the established network, three encodings (NCC_nw, NCC_ww, and NDC) were employed to measure network centrality, which are briefly described as follows.

Firstly, the closeness centrality score CC_nwi for any residue i within a network was calculated as


Formula 003M1

(1)

where n was the total number of vertices in the graph and dij was the shortest path distance between vertices i and j, calculated using the Dijkstra algorithm (del Rio et al., 2001Go). The score was normalized over the entire structure as

Formula 003M2

(2)
where NCC_nwi was the normalized closeness centrality score for residue i, Formula was the average value of closeness centrality over all residues, and {sigma}(CC_nw) was the standard deviation. In the above calculation, no weight was assigned for any edge within the network graph, i.e. dij was equal to the number of edges on the shortest path from vertex i to j. Therefore, the above NCC_nw encoding means the normalized closeness centrality score without weight. Meanwhile, we also weighted each edge by the shortest distance between the two corresponding residues to construct the NCC_ww encoding, i.e. the normalized closeness centrality score with weight, which was calculated using similar equations except for that dij was equal to the sum of weights on edges along the shortest path from vertex i to j.

Additionally, the NDC encoding, i.e. the normalized degree centrality score, was also derived. For each residue i, NDCi was defined as


Formula 003M3

(3)

where DCi was the degree centrality, defined as the number of edges connecting to vertex i, Formula was the average value of degree centrality over all residues, and {sigma}(DC) was the corresponding standard deviation.

Relative position Active sites in almost all enzymes reside in clefts (Bartlett et al., 2002Go; Tseng and Liang, 2007Go). Therefore, cleft environment was used here to present the relative position of a given residue. First, all clefts for a given structure were assigned by SURFNET (Laskowski, 1995Go). As described by Gutteridge et al. (Gutteridge et al., 2003Go), the relative position of a residue was then divided into four categories according to the size of the cleft in which it located. Finally, the Cleft encoding for a residue was assigned, i.e. lying in the largest cleft (1 0 0 0), the second or third largest (0 1 0 0), the fourth to ninth largest (0 0 1 0) or none of the above clefts (0 0 0 1).

Hydrogen bonding Most catalytic residues act as donor or acceptor in at least one hydrogen bond. In particular, hydrogen bonds from main chain atoms to other residues in a protein are important in maintaining the conformation of these catalytic residues (Bartlett et al., 2002Go). Hydrogen bonds were calculated using HBPLUS (McDonald and Thornton, 1994Go), and the following three parameters were used to represent this property. NmHB is the number of hydrogen bonds from a main-chain atom in a given residue to any other atom in a protein, NsHB denotes the number of hydrogen bonds from a side-chain atom in a given residue to any other atom in a protein, and tNHB indicates the total number of hydrogen bonds involving any atom in a given residue.

Relative solvent accessibility It has been well established that catalytic residues are generally more exposed to solvent than non-catalytic residues. Accordingly, we calculated relative solvent accessibility (RSA) for residues via NACCESS (Hubbard and Thornton, 1993Go), and five RSA based encodings were constructed: AaRSA means the RSA of all atoms; TsRSA is the RSA of all side chain atoms, including alpha carbons; NpRSA stands for the RSA of non-polar side chain atoms (i.e. all non-oxygens and non-nitrogens in the side chain); ApRSA is the RSA of all polar side chain atoms (i.e. all oxygen and nitrogen in the side chain); and McRSA is the RSA of all main chain atoms.

Structural flexibility Catalytic residues are often more rigid than average ones in an enzyme structure (Bartlett et al., 2002Go; Yuan et al., 2003Go). Here, two normalized B-factors based encodings (NBf_RES and NBf_CA) were calculated to measure residue flexibility. NBf_RES was the normalized B-factor of a residue, which was given by

Formula 003M4

(4)
where BRES was the average B-factor over all atoms in a residue, Formula was the average BRES over all residues, and {sigma}(RES) was the corresponding standard deviation. NBf_CA was the normalized B-factor of C{alpha} atom, which was defined as


Formula 003M5

(5)

where BCA was the B-factor of C{alpha} atom in a residue, Formula was the average value over C{alpha} atoms from all residues, and {sigma}(CA) was the corresponding standard deviation.

Secondary structure It is well known that catalytic residues are more inclined to locate in coil regions (Bartlett et al., 2002Go). Therefore, secondary structure information may be helpful in catalytic residue prediction. DSSPcont (Carter et al., 2003Go) was used to assign secondary structure state. The structural categories generated by DSSPcont include 310-helix (G), {alpha}-helix (H), {pi}-helix (I), β-strand (E), isolated β-bridge (B), turn (T), and bend (S) and other. In this paper, these eight states were simplified to helix = {G, H, I}, sheet = {E, B}, and coil = {T, S and other}. For each residue, the SS3 (Three-State Secondary Structure) based encoding was assigned, i.e. (0 0) for helix, (0 1) for sheet, and (1 0) for coil.

Training and testing

Testing based on balanced datasets To validate the performance based on different encodings as well as different machine learning methods, the ratio of positive instances (i.e. catalytic residues) to negative instances (non-catalytic residues) was initially set as 1:1. A 10-fold cross-validation was performed. Since the number of available non-catalytic residues is much larger than that of catalytic residues, five different negative sets were randomly selected to train and test a predictor for a reliable assessment.

First, an integrated SVM program named LIBSVM (Chang and Lin, 2001Go) was used for evaluating each encoding with default parameters. The applied kernel function here is the radial basis function. Secondly, the feature selection tool (Chang and Lin, 2001Go) based on LIBSVM was employed to find the optimal subset of properties, which turned out to be eight encodings with a dimension of 30 in this work.

Moreover, we passed the best property subset to GANN, in which a GA was performed to optimize the connection weights of an ANN over the training dataset. The current GANN contains one input layer, one hidden layer, and one output layer. To obtain the optimized connection weights, a four-step genetic process was applied. First, an initial population of chromosomes is randomly created in the first generation. Each chromosome is used to encode a weight vector of the neural network. Secondly, a fitness value is assigned to each chromosome in the current generation. The fitness function (f) for GA is defined as the Matthews correlation coefficient (MCC). Thirdly, three operators (SELECTION, CROSSOVER, and MUTATION) are applied to the chromosomes of the current generation to obtain the new chromosomes of the next generation. In the fourth step, the above iterative training procedures are carried out to obtain the newer generation until fulfilling a terminal condition. At the end of training, the best chromosome with the highest fitness from the last generation is selected to create an ANN prediction model that can be used to perform a feed-forward computation to obtain the prediction output over a test dataset. After preliminary optimization, in this work parameters used in the GANN algorithm were set as follows: (i) the number of input nodes: 30; (ii) the number of hidden nodes: 5; (iii) the number of output nodes: 1; (iv) maximum generation number: 1420; (v) population size: 100; (vi) crossover probability: 0.95; (vii) initial mutation probability: 0.015; (viii) threshold of the fitness value: 0.9. For more details about the GANN algorithm and corresponding configurations, please refer to our recent publication related to GANN based protein phosphorylation site predictor (Tang et al., 2007Go).

Prediction in entire structures Since there are much more non-catalytic than catalytic residues in real enzymes, the predictor trained with balanced datasets is not suitable. To construct a better predictor for the prediction in entire structures, the ratio must be optimized. In this test, 79 enzymes were divided into 10 roughly equal groups by structure, i.e. each group contained seven or eight intact structures. Again, a 10-fold cross-validation was performed. For every testing group, we constructed five training sets by varying the non-catalytic residues. Performance was averaged over results based on all the tests.

Using LIBSVM based on the selected eight properties, it was found that the best ratio of catalytic to non-catalytic residues in the training sets was 1:6, which is consistent with Gutteridge et al.'s work (Gutteridge et al., 2003Go). However, at such a relatively large proportion of non-catalytic residues, some structural properties might bring in more noise than useful information, so we tried to re-optimize the selected subset by removing that kind of properties. Since the dimensionality of the input feature vector changed, the number of hidden nodes and the terminal condition in the GANN algorithm were also re-optimized.

Performance measure

Four measurements, i.e. accuracy (AC), true positive rate (TPR), false positive rate (FPR), and MCC, were used to evaluate the prediction performance with definition as follows:


Formula 003M6

(6)


Formula 003M7

(7)


Formula 003M8

(8)


Formula 003M9

(9)

where tp, fp, fn, and tn denote true positives, false positives, false negatives, and true negatives. When the numbers of positive and negative data are different, MCC should be more suitable for assessing the overall prediction accuracy. The value of MCC ranges from –1 to 1, and higher MCC means better prediction performance.

When a prediction is performed on an entire structure, it is also important to know if the active site can be correctly identified. Based on the predicted catalytic residues, the following procedures described in Gutteridge et al.'s paper (Gutteridge et al., 2003Go) were employed to locate the predicted active sites. First, two predicted catalytic residues were clustered together if the shortest distance between them was no more than 4.0 Å, and each cluster was represented by a sphere whose centre was the geometric centroid of Cβ atoms of all component residues (C{alpha} atom in glycine) and radius was equal to the distance from the farthest Cβ atom to the centre. Secondly, single residues were added to an existing cluster if this would not increase its radius to over 20.0 Å. Thus, several clusters could be constructed and each cluster was considered as a predicted active site. Known active sites were also defined as spheres as above, and a radius of 3.0 Å was assigned for a single-residue site. For each enzyme structure, a correct active site prediction means the overlap between a predicted active site and the corresponding known site is greater than 50% of the volume of the known one; a partially correct prediction means the overlap is less than 50%; and an incorrect prediction means no overlap at all.


    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
Results based on balanced datasets

In this work, 18 residue property based encodings were individually evaluated with the assistance of LIBSVM program. As shown in Fig. 1A, sequence conservation based encoding (i.e. cons) remained the most informative encoding, which is in line with previous studies (Gutteridge et al., 2003Go; Petrova and Wu, 2006Go). Interestingly, closeness centrality based encoding (NCC_ww and NCC_nw) appeared to be the second discriminative feature. In addition, performance based on closeness centrality seemed relatively steady over different datasets (cf. Fig. 1A). Comparatively, NCC_ww was a little bit more powerful than NCC_nw (cf. Fig. 1A), probably due to that NCC_ww could consider the intensity of interaction between residues to some extent. Also in accordance with previous study (Amitai et al., 2004Go), NDC encoding was not useful. A plausible reason is that degree centrality only described the local environment around a residue, while closeness centrality was inclined to characterize a residue by its relationship with all residues in the structure, which was more helpful to decide what role this residue played in the entire enzyme.


Figure 1
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Property evaluation and selection using LIBSVM. (A) Accuracy based on individual property. Error bars indicated standard deviations. (B) The prediction accuracy when other properties were added to AA_type20 step by step. The bold solid line indicated the performance of Petrova and Wu’s method (Petrova and Wu, 2006Go).

 
Moreover, the feature selection tool based on LIBSVM was employed to select the optimized subset of properties. As shown in Fig. 1B, eight encodings with a dimension of 30 jointly contributed to an optimal performance in predicting catalytic residues, which fell into five categories, i.e. residue type, sequence conservation, network centrality, relative position, and hydrogen bonding. It is interesting to mention that NCC_nw contributed more than NCC_ww in the optimized subset of residue properties (cf. Fig. 1B), which might be due to the fact that NCC_ww is affected by the size of side chains in different residues that enlarges its overlap with AA_Type20. The three hydrogen-bonding-based encodings (NmHB, NsHB, and tNHB) were not so powerful when testing alone, but they helped when added to the first five encodings (cf. Fig. 1B), owing to the fact that hydrogen bonding presents another aspect of residue property, i.e. conformational freedom, which is not covered in the first five encodings. Meanwhile, other properties could not help much when added, because to some extent, they may have overlap with encodings in the optimal subset, e.g. AA_Type3 with AA_Type20, RSA with Cleft, structural flexibility with hydrogen bonding, etc. Detailed analyses of these optimal properties were illustrated in Fig. 2. It was suggested that catalytic and non-catalytic residues did differ in these characteristics. In particular, it was clear that catalytic residues tended to have both high conservation scores and high closeness centrality values (cf. Fig. 2C).


Figure 2
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Analyses of residue properties in the optimal subset. (A) Frequency distribution of 20 amino acids in catalytic and non-catalytic residues; (B) number of hydrogen bonds in catalytic and non-catalytic residues; (C) residue conservation and closeness centrality properties; (D) relative position of catalytic and non-catalytic residues.

 
The same datasets and inputs were used to train and test GANN based algorithm. It turned out that GANN can achieve a better performance than LIBSVM (cf. Table I). Compared with LIBSVM, the average accuracy was increased by 3.0%. This might indicate that GANN was more applicable for this kind of data than SVM.


View this table:
[in this window]
[in a new window]

 
Table I. Performance of different algorithms based on the balanced training datasets

 
Results of prediction in entire structures

As a matter of fact, the ratio of catalytic to non-catalytic residues is quite different from 1:1 in real enzymes. Therefore, our method was also tested on entire enzyme structures. To conduct such a prediction, the ratio of catalytic to non-catalytic residues in the training datasets was optimally set as 1:6, and the testing was performed against entire enzymes. Based on the same datasets, performance of LIBSVM and GANN based algorithms was compared in Table II. On the whole, GANN still showed its advantage over the LIBSVM based method. With only three properties (AA_Type20, cons, and NCC_nw), GANN could achieve an MCC of 0.364, while LIBSVM needed seven attributes (AA_Type20, cons, NCC_nw, Cleft, NCC_ww, NmHB, and tNHB) to reach an MCC of 0.342. In addition, the most noteworthy superiority was that GANN was much more sensitive when handling data with such a large portion of negative instances. In comparison to LIBSVM, GANN could increase TPR from 57.8 to 73.2%, without notable increase in FPR (only from 2.6 to 3.8%) (cf. Table II).


View this table:
[in this window]
[in a new window]

 
Table II. Performance of different algorithms based on the 1:6 training datasetsa

 
We also tried to locate active sites in enzyme structures according to the predicted catalytic residues. Spheres containing clusters of predicted catalytic residues were used to represent predicted active sites as described in the method section. As shown in Table III, based on the prediction of GANN, 73.9% of the enzymes had the active site correctly located, thanks to GANN’s increased sensitivity, and in another 20.9% the locating was partially correct. Actually, predicted sites often lay close to the known active sites, as only 5.2% of the tested enzymes had no predicted active sites overlapping with the known ones.


View this table:
[in this window]
[in a new window]

 
Table III. Performance of different algorithms in locating active sites based on the 1:6 training datasetsa

 
To intuitively show the difference resulted from different ratios of positive and negative data in training datasets, the catalytic residue prediction of an enzyme structure (i.e. aspartylglucosaminidase, PDB entry: 1apy) was exemplified. As shown in Fig. 3, when using a balanced training set (1:1), all catalytic residues could be successfully identified, but false positive rate was also quite high. When using a 1:6 training set, much fewer non-catalytic residues were incorrectly predicted as catalytic yet true positive rate fell significantly at the same time. Actually all residues would be predicted as non-catalytic when the proportion of negative instances kept growing in the training sets. Compared with the prediction based on a 1:1 training dataset, it is interesting to mention that most of the false positives located close to catalytic residues, which indicated that they were quite likely to be involved in the binding of substrates or the stabilization of products. Due to the relationship between false and true positives, location of the active sites in most enzymes including 1apy can be correctly detected, although the identification of catalytic residues were not as precise as in the 1:1 model. However, how to discriminate catalytic residues and their structural neighbouring residues remains a challenge to improve the accuracy of predicting catalytic residues in enzyme structures.


Figure 3
View larger version (59K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Predicted catalytic residues in aspartylglucosaminidase (PDB entry: 1apy). (A) Prediction based on a 1:1 training dataset; (B) prediction based on a 1:6 training dataset. White spheres indicated true positives, black spheres indicated false negatives, and false positives were shown by their side chains in white sticks.

 
Comparison of the proposed method with two existing methods

Using the same 79 enzyme structures previously used in Petrova and Wu’s method allowed a fair comparison between the performance of their method and ours. Benefited from the network closeness centrality based encoding, only four properties (AA_Type20, cons, NCC_nw, and Cleft) were able to achieve the same accuracy as Petrova and Wu’s method (cf. Fig. 1B). Furthermore, the LIBSVM algorithm based on the optimal properties slightly surpassed Petrova and Wu’s method (cf. Table I). Further empowered by a new machine learning algorithm, the GANN based prediction can result in an even higher accuracy (about +4.0%) (cf. Table I). Considering that only a small fraction of residues in an enzyme structure are catalytic residues, the different choice of negative dataset may have significant impact on the reported accuracy. Either in this study or Petrova and Wu’s paper, the prediction accuracy was averaged over several 10-fold cross-validation tests by changing negative datasets (i.e. non-catalytic residues). Thus, the comparison between these two methods should be reliable.

When testing the prediction in entire structures, Petrova and Wu’s method did not optimize the corresponding ratio of catalytic to non-catalytic residues in the training dataset, and only achieved an MCC of 0.23. We then compared our method with Gutteridge et al. As shown in Table II, our GANN method increased MCC to 0.364 (Gutteridge et al.: 0.28 before clustering, 0.32 after clustering). Considering the correct location of active sites, our GANN also demonstrated a nearly 5.0% higher accuracy (cf. Table III). Although both the datasets used in Gutteridge et al.'s method and ours were extracted from the Catalytic Site Atlas (Porter et al., 2004Go), noted that the data set used in ours is smaller but more stringent since the redundancy had been removed as reported by Petrova and Wu (2006)Go. Thus, such a comparison is generally reasonable.

Future perspective

By using network closeness centrality as one of the key input features as well as adopting the GANN algorithm, not only a high accuracy was achieved in catalytic residue identification in the balanced model, but also most active sites in real enzymes were successfully located. One immediate application is to combine the current algorithm with active site templates based searching method for a more reliable active site prediction. Therefore, the current algorithm can be useful in the functional annotation of newly determined protein structures from the Structural Genomics Projects (Brenner, 2001Go).

In spite of the improvement indicated above, the MCC value of our method remained below 0.4 in the identification of catalytic residues in entire structures, suggesting that the current algorithm alone was still not good enough for practical use. To improve the identification of catalytic residues, filtering out some non-catalytic residues before prediction should be helpful. For instance, residues located in the functional surface can be computationally identified first, which have been materialized in several algorithms (Tseng et al., 2007). Thus, difference between catalytic and non-catalytic residues may be even more obvious due to the reduction of noise. Exploring new properties (encodings) also leads to an important direction to develop a better predictor. In this study, closeness centrality plays a more important part than any other structural feature. To some extent, this is due to the fact that closeness centrality characterizes the relationship between a given residue and all other residues in a protein structure, which helped to decide its role in the entire enzyme when catalyzing a reaction. To detect functional sites within a protein, efforts have been increasingly paid on finding some new sequence or structural properties (e.g. see Refs. Bagley and Altman, 1995Go; Bate and Warwicker, 2004Go; Liang et al., 2006Go; Ofran and Rost, 2007Go), which may further be validated for their suitability in predicting catalytic residues. We expect that newly identified properties will not only improve the accuracy in predicting catalytic residues, but also strengthen our basic understanding in molecular mechanisms of enzymatic reaction.


    Funding
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
The National Natural Science Foundation of China (30700137).


    Footnotes
 
1 Both authors contributed equally to this work. Back

2 Present address: National Institute of Biological Sciences, No. 7 Science Park Road, Beijing 102206, China Back

Edited by Valerie Daggett


    Acknowledgements
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
The authors extend their gratitude to Dr James Torrance for providing the fully annotated version of the Catalytic Site Atlas. This research was supported by the National Natural Science Foundation of China (30700137).


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Funding
 Acknowledgements
 References
 
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

Amitai G., Shemesh A., Sitbon E., Shklar M., Netanely D., Venger I., Pietrokovski S. J. Mol. Biol. (2004) 344:1135–1146.[CrossRef][Web of Science][Medline]

Bagley S.C., Altman R.B. Protein Sci. (1995) 4:622–635.[Web of Science][Medline]

Bartlett G.J., Porter C.T., Borkakoti N., Thornton J.M. J. Mol. Biol. (2002) 324:105–121.[CrossRef][Web of Science][Medline]

Bate P., Warwicker J. J. Mol. Biol. (2004) 340:263–276.[CrossRef][Web of Science][Medline]

Ben-Shimon A., Eisenstein M. J. Mol. Biol. (2005) 351:309–326.[CrossRef][Web of Science][Medline]

Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. Nucleic Acids Res. (2000) 28:235–242.[Abstract/Free Full Text]

Brenner S. Nature Rev. Genet. (2001) 2:801–809.[CrossRef][Web of Science][Medline]

Carter P., Andersen C.A., Rost B. Nucleic Acids Res. (2003) 31:3293–3295.[Abstract/Free Full Text]

Chang C.C., Lin C.J. Computer Program (2001) Taipei, Taiwan: Department of Computer Science, National Taiwan University.

Chea E., Livesay D.R. BMC Bioinformatics (2007) 8:153.[CrossRef][Medline]

Cho S.B. Fuzzy Sets Syst. (1999) 103:339–347.[CrossRef]

Chou K.C., Cai Y.D. Proteins (2004) 55:77–82.[CrossRef][Medline]

del Rio G., Bartley T.F., del-Rio H., Rao R., Jin K.L., Greenberg D.A., Eshoo M., Bredesen D.E. FEBS Lett. (2001) 509:230–234.[CrossRef][Medline]

del Sol A., Fujihashi H., Amoros D., Nussinov R. Protein Sci. (2006) 15:2120–2128.[CrossRef][Web of Science][Medline]

Fish K.E., Johnson J.D., Dorsey R.E., Blodgett J.G. J. Business Res. (2004) 57:79–85.[CrossRef]

Goyal K., Mohanty D., Mande S.C. Nucleic Acids Res. (2007) 35(Web Server issue):W503–W505.[Abstract/Free Full Text]

Greene L.H., Higman V.A. J. Mol. Biol. (2003) 334:781–791.[CrossRef][Web of Science][Medline]

Gutteridge A., Bartlett G.J., Thornton J.M. J. Mol. Biol. (2003) 330:719–734.[CrossRef][Web of Science][Medline]

Hubbard S.J., Thornton J.M. Computer Program (1993) London: Department of Biochemistry and Molecular Biology, University College.

Ko J., Murga L.F., Wei Y., Ondrechen M.J. Bioinformatics (2005) 21(Suppl. 1):258–265.[CrossRef]

Laskowski R.A. J. Mol. Graph. (1995) 13:323–330.[CrossRef][Web of Science][Medline]

Li W., Godzik A. Bioinformatics (2006) 22:1658–1659.[Abstract/Free Full Text]

Liang S., Zhang C., Liu S., Zhou Y. Nucleic Acids Res. (2006) 34:3698–3707.[Abstract/Free Full Text]

McDonald I.K., Thornton J.M. J. Mol. Biol. (1994) 238:777–793.[CrossRef][Web of Science][Medline]

Murzin A.G., Brenner S.E., Hubbard T., Chothia C. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]

Nagano N., Orengo C.A., Thornton J.M. J. Mol. Biol. (2002) 321:741–765.[CrossRef][Web of Science][Medline]

Ofran Y., Rost B. Bioinformatics (2007) 23:e13–e16.[Abstract/Free Full Text]

Ofran Y., Punta M., Schneider R., Rost B. Drug Discov. Today (2005) 10:1475–1482.[CrossRef][Web of Science][Medline]

Orengo C.A., Todd A.E., Thornton J.M. Curr. Opin. Struct. Biol. (1999) 9:374–382.[CrossRef][Web of Science][Medline]

Petrova N.V., Wu C.H. BMC Bioinformatics (2006) 7:312.[CrossRef][Medline]

Porter C.T., Bartlett G.J., Thornton J.M. Nucleic Acids Res. (2004) 32:D129–D133.[Abstract/Free Full Text]

Rost B. J. Mol. Biol. (2002) 318:595–608.[CrossRef][Web of Science][Medline]

Shapiro L., Harris T. Curr. Opin. Biotechnol. (2000) 11:31–35.[CrossRef][Web of Science][Medline]

Tang Y.R., Chen Y.Z., Canchaya C., Zhang Z. Protein Eng. Des. Sel. (2007) 20:405–412.[Abstract/Free Full Text]

Tian W., Skolnick J. J. Mol. Biol. (2003) 333:863–882.[CrossRef][Web of Science][Medline]

Todd A.E., Orengo C.A., Thornton J.M. J. Mol. Biol. (2001) 307:1113–1143.[CrossRef][Web of Science][Medline]

Torrance J.W., Bartlett G.J., Porter C.T., Thornton J.M. J. Mol. Biol. (2005) 347:565–581.[CrossRef][Web of Science][Medline]

Tseng Y.Y., Liang J. Ann. Biomed. Eng. (2007) 35:1037–1042.[CrossRef][Medline]

Valdar W.S. Proteins (2002) 48:227–241.[CrossRef][Web of Science][Medline]

Youn E., Peters B., Radivojac P., Mooney S.D. Protein Sci. (2007) 16:216–226.[CrossRef][Medline]

Yuan Z., Zhao J., Wang Z.X. Protein Eng. (2003) 16:109–114.[Abstract/Free Full Text]

Zhang Z., Grigorov M. Proteins (2006) 62:470–478.[CrossRef][Web of Science][Medline]

Zhang Z., Tang Y.R. Protein Pept. Lett. (2007) 14:291–297.[CrossRef][Medline]

Received October 15, 2007; revised January 4, 2008; accepted January 4, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
H. David-Eden and Y. Mandel-Gutfreund
Revealing unique properties of the ribosome using a network based analysis
Nucleic Acids Res., August 1, 2008; 36(14): 4641 - 4652.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/5/295    most recent
gzn003v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Tang, Y.-R.
Right arrow Articles by Zhang, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tang, Y.-R.
Right arrow Articles by Zhang, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?