PEDS Advance Access originally published online on September 4, 2008
Protein Engineering Design and Selection 2008 21(11):659-664; doi:10.1093/protein/gzn045
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A novel hierarchical ensemble classifier for protein fold recognition
Information Engineering College, Xiangtan University, Xiangtan 411105, Hunan, PR China
1 To whom correspondence should be addressed. E-mail: xpgao{at}xtu.edu.cn
| Abstract |
|---|
|
|
|---|
The ensemble classifier plays a critical role in protein fold recognition. In this article, a novel hierarchical ensemble classifier named GAOEC (Genetic-Algorithm Optimized Ensemble Classifier) is presented and it can be constructed in the following steps. First, a novel optimized classifier named GAET-KNN (Genetic-Algorithm Evidence-Theoretic K Nearest Neighbors) is proposed as a component classifier. Second, six component classifiers in the first layer are used to get a potential class index for every query protein. Third, according to the results of the first layer, every component classifier in the second layer generates a 27-dimension vector whose elements represent the confidence degrees of 27-folds. Finally, genetic algorithm is used for generating weights for the outputs of the second layer to get the final classification result. The standard percentage accuracy of GAOEC is 64.7% on a widely used benchmark dataset, where the proteins in the testing set have less than 35% identity with those in the training set.
Keywords: ET-KNN/GAET-KNN/genetic algorithm/hierarchical ensemble classifier/protein fold recognition
| Introduction |
|---|
|
|
|---|
With the magnitude of new sequences growth, it is urgent to develop novel structure prediction algorithms to determine the structure of new sequences. A number of methods have achieved some success through recognizing protein fold, which is a common three-dimensional pattern with the same major secondary structure elements in the same arrangement and with the same topological connections (Craven et al., 1995
In recent years, the taxonometric approach without relying on similarity features plays a critical role in protein fold recognition (Ding and Dubchak, 2001
; Huang et al., 2003
; Tan et al., 2003
; Nanni, 2005
; Nanni, 2006a; Nanni, 2006b, Shen and Chou, 2006
). The taxonometric approach directly extracts features from query proteins and transfers protein fold recognition to a multi-class problem. As we know, the larger the number of classes, the more difficult the classification for multi-class problems and the number of protein folds exceeds 1000. Therefore, it is maybe impossible to recognize the fold of a query protein in all protein folds through the taxonometric approach. Most taxonometric approaches (Ding and Dubchak, 2001
; Huang et al., 2003
; Tan et al., 2003
; Nanni, 2005
; Nanni, 2006a; Nanni, 2006b, Shen and Chou, 2006
) determine the fold of the query in 27-folds, which have no less than seven proteins and represent all major structural classes:
, β,
/β and
+β. Protein fold recognition in the 27-folds is also a challenging task in the research of protein fold recognition. Here, we focus on the ensemble classifier for protein fold recognition in the 27-folds as previous researchers. In general, an ensemble classifier is constructed through training a number of component classifiers and then integrating the component predictions. Therefore, the performance of component classifiers and the efficiency of ensemble strategies are the major factors for the classification accuracy of ensemble classifiers. Although PFP-Pred (Shen and Chou, 2006
) has achieved better performance comparing with previous researches, it has two shortcomings. First, because the optimized parameters in OET-KNN (Optimized Evidence-Theoretic K Nearest Neighbors) are far from the global optimum, the component classifier OET-KNN is not very efficient. Second, the ensemble weighted strategy, wherein the classification accuracies of component classifiers sever as corresponding weights, does not generate the optimum weight vector to maximize classification accuracy.
In this article, a novel two-layer ensemble classifier named GAOEC is presented. First, the component classifiers in the first layer are used to get a potential class index for every query protein in the 27-folds. Second, according to the potential class index, every component classifier in the second layer generates a 27-dimension vector for every query protein, wherein each element represents the confidence degree of its corresponding fold. The rude classification in the first layer helps the operation of the second layer through excluding the training samples whose class labels are not consistent with any element of the potential class index. As previous researches, six kinds of features are used to recognize protein fold. Each layer includes six component classifiers. In practice, reducing reasonably the noise in the training set increases the accuracies of the second layer by up to 10% compared with those of the first layer. The component classifiers in each layer are GAET-KNNs, which use genetic algorithm (GA) to generate the optimum parameter vector in ET-KNNs to maximize classification accuracy. It proves in practice that GAET-KNN as the component classifier achieves much better accuracy than OET-KNN. Additionally, tuning the number of the nearest neighbors in GAET-KNN has little influence on classification accuracy. Considering its powerful global optimization performance, GA is used for generating the weights for the outputs of the second layer to maximize classification accuracy. Last, GAOEC generates a 27-dimension vector and the index of the maximum element in the vector is the final classification result for the query. In this article, majority voting also severs as the ensemble strategy for assessing the performance of the classification system.
| Materials and methods |
|---|
|
|
|---|
Materials
The dataset studied here is taken from Ding and Dubchak (Ding and Dubchak, 2001
). It contains 313 proteins in the training set and 385 proteins in the testing set. Proteins in the training set and the testing set are classified into 27 most populated folds (Table A1) which have no less than seven proteins and represent all major structural classes:
, β,
/β, and
+β (Ding and Dubchak, 2001
). None of proteins in the testing set has more than 35% sequence identity with those in the training set. Six kinds of features are extracted from every protein in the dataset (Table I).
|
Various accuracy measures are used to assess the performance of classifiers for protein fold recognition. True positive rate and false positive rate are usually used in two-way classifiers. Analogously, sensitivity and specificity are usually used in the similarity-based methods (Bindewald et al., 2003
Suppose ni denotes the number of testing proteins in the ith fold and only ci proteins are correctly classified. So, the classification accuracy of the ith fold can been denoted as Qi=ci/ni. Thus, the overall classification accuracy can be formulated as follows:
|
| 1 |
|
| 2 |
|
| 3 |
The two-layer ensemble classifier (GAOEC) As reviewed in Introduction section, although PFP-Pred performs well comparing with previous ensemble classifiers, both its component classifiers and ensemble weighted strategy are not efficient enough. For multi-class problems, it is obvious that the larger the number of classes, the more difficult classification. Reducing reasonably the noise in the training set might be an efficient means for improving classification performance.
In this section, a novel hierarchical ensemble classifier (GAOEC) is presented and it can be constructed in the following steps (Fig. 1). First, a novel optimized classifier (GAET-KNN) is proposed as a component classifier. Second, two layer GAET-KNNs are used to classify query proteins in the 27-folds. As previous researches, six kinds of features are extracted from every protein in the dataset. Each layer includes six GAET-KNNs. When a query protein is presented to GAOEC, every GAET-KNN in the first layer generates a 27-dimension vector, wherein each element represents the confidence degree of its corresponding fold. The index of the maximum element in the vector acts as the class label of the query. Thus, the number of potential folds for the query is not more than six. Therefore, the rude classification in the first layer can help the operation of the second layer through excluding the training samples whose class labels are not consistent with any one of the potential class index obtained by the first layer. Third, GA is used for generating the weights for the outputs of the second layer to maximize classification accuracy. Finally, GAOEC generates a 27-dimension vector and the index of the maximum element in the vector is the final classification result for the query.
|
The component classifier (GAET-KNN) In this article, rather than using existing OET-KNN, we propose a novel optimized classifier GAET-KNN as the component classifier which use GA to generate the optimum parameters in ET-KNN to maximize classification accuracy. For readers convenience, the ET-KNN rule used in paper is recalled as follows.
There are 27-folds, which can be denoted by S= {c1, c2, ... ,c27}. The training set, which contains N n-dimensional patterns x(i), can be denoted by
={(x(1), c(2)), (x(2), c(2)), ... , (x(N), c(N))}. The class label c(i) takes value in the set S. The similarity between two patterns is measured by Euclidean distance. If the distance is small, two patterns are deemed as belonging to the same class, otherwise, their classes are completely irrelevant. Therefore, the training sample with the smallest distance is deemed as the nearest neighbor for the query. According to the ET-KNN rule, every neighbor of the query x is considered as an item of evidence supporting certain hypotheses concerning the class membership of the query (Denoeux, 1995
). So, K nearest neighbors with the smallest distance are used to determine the fold of the query. A basic belief assignment (BBA) is assigned to every neighbor. The resulting BBA is obtained through aggregating the BBA of the K nearest neighbors using the Dempsters rule. Consequently, a BBA can be defined by (Denoeux, 1995
):
|
| 4 |
|
| 5 |
where d (x, x(i)) is the Euclidean distance between x and x(i), Cu is the class of x(i),
is a fixed parameter such that 0<
<1 and
u is a positive parameter associated to the class Cu.
According to Dempsters rule, a resulting BBA m regarding the class of the query x can be formulated by (Denoeux, 1995
):
|
| 6 |
denotes the orthogonal sum and Ik={i1, ... , ik} is the indexical set of the K nearest neighbors of x.
Thus, m can be shown as the following expression (Denoeux, 1995
):
| 7 |
| 8 |
Thus, the class of the query x is Cu, if
|
| 9 |
Although the value of
has been proved not to be too critical, the tuning of the parameter vector
={
1,
2,
3, ... ,
u} has significant influence on classification accuracy (Denoeux, 1995
). A number of methods have been proposed to improve the performance of ET-KNN through generating more appropriate
u. For example,
u is set to the inverse of the mean distance between training patterns belonging to the class Cu (Denoeux, 1995
) or is determined through optimizing a performance criterion (Zouhal and Denoeux, 1998
). However, these methods can not get a global optimum parameter vector
.
As a powerful global optimization strategy, GA has already been successfully applied in various areas. Here, we use it to generate the optimum parameter vector
for ET-KNN and propose a novel optimized classifier named GAET-KNN. GAET-KNN generates a random parameter vector and then employs GA to evolve the parameter vector to achieve the highest classification accuracy. Thus, GAET-KNN can determine the parameters in ET-KNN as a whole and maximize classification accuracy when GA generates the optimum parameter vector
.
In this article, GAET-KNN is realized by utilizing NSGA-II (Deb et al., 2002
) and a floating coding scheme. It is obvious that the higher the classification accuracy, the better the parameter vector. Owing to the working scheme of NSGA-II, wherein the individual in the first rank has the smallest fitness value, the error rate of classification is used as the fitness function in NSGA-II. Of course, GAET-KNN can also be realized through using other kinds of GAs and coding schemes. The GAET-KNN is summarized in Fig. 2, where
is a training set, L is ET-KNN, A is the classification accuracy basing on
, N is the number of generations in NSGA-II.
|
The ensemble strategy Here, GA and major voting are used for generating weights for the second layer GAET-KNNs to get the final classification accuracy. The process of integrating the second layer GAET-KNNs using GA can be defined as follows:
| 10 |
So the resulting class for the query x is Cu with which score of the resulting BBA Yu is the highest (Shen and Chou, 2006
); i.e. suppose
|
| 11 |
| Results and discussion |
|---|
|
|
|---|
To be comparable with previous researchers, we test our method on the widely used datasets where the identity between two different protein sequences is below 35% and most sequences in testing set have less than 25% sequence identity with those in training set (Ding and Dubchak, 2001
The comparisons of 15 methods for protein fold recognition are shown in Table II. The methods (Bindewald et al., 2003
; Chinnasamy et al., 2003
; Gewehr et al., 2004
; Altschul et al., 1997
) are similarity-based methods in which the features are alignment features and the accuracy measure is the percent of correct first hits. The methods (Ding and Dubchak, 2001
; Huang et al., 2003
; Nanni, 2006a; Nanni 2006b; Shen and Chou, 2006
) and GAOEC are taxonometric approaches in which the features are biochemical features and the accuracy measure is the standard percentage accuracy. Thus, the comparisons are only meant to provide a broad, rough performance assessment rather than a precise ranking. Although the features in the similarity-based methods (Bindewald et al., 2003
; Chinnasamy et al., 2003
; Gewehr et al., 2004
; Altschul et al., 1997
) are different, the features in the taxonometric approaches (Ding and Dubchak, 2001
; Huang et al., 2003
; Nanni, 2006a; Nanni, 2006b; Shen and Chou, 2006
) are the same. Therefore, the comparisons of the taxonometric approaches (Ding and Dubchak, 2001
; Huang et al., 2003
; Nanni 2006a; Nanni 2006b; Shen and Chou, 2006
) and GAOEC can provide a relative reasonable ranking. The standard percentage accuracy of GAOEC is 64.7% for independent test when the number of the nearest neighbors in GAET-KNN is 23. At 65.3% specificity, the sensitivity of GAOEC is 77.4% for independent test when the number of the nearest neighbors in GAET-KNN is 12. Sensitivity is defined as the percentage of query proteins whose class labels are ranked the first by at least one GAET-KNN in the first layer. Specificity is defined as the ratio between the proteins correctly classified by GAOEC and the proteins whose class labels are ranked the first by at least one GAET-KNN in the first layer. In the paper, majority voting is also used as the ensemble strategy to assess the classification system and the best standard percentage accuracy is 63.7%.
|
In detail, when the number of the nearest neighbors in GAET-KNN is 23, the prediction accuracies of first layer classifiers C1, C2, C3, C4, C5, C6 are 54.8%, 46.8%, 43.1%, 40.5%, 41.6%, and 40.0%, respectively, which are much higher than the accuracies of SVM and ET-KNN (Table III). It proves in practice that GA greatly improves the performance of ET-KNN through generating the global optimum parameter vector
. The most effective feature is amino acid composition, consistently with former researchers (Ding and Dubchak, 2001
|
|
GAET-KNN is not only efficient but also stable. It proves in practice that tuning the number of nearest neighbors in GAET-KNN has little influence on classification accuracy. The accuracies of GAET-KNNs in the first and second layers are represented as a function of the number of nearest neighbors and are shown in Figs 3 and 4, respectively.
|
|
| Conclusion |
|---|
|
|
|---|
We have presented an optimized hierarchical ensemble classifier GAOEC for protein fold recognition in the following steps. First, we use GA to optimize parameters in ET-KNN and propose a novel classifier (GAET-KNN) as the component classifier. It has shown high and robust performance of classification. Second, we present a two-layer GAET-KNNs structure to classify the query protein in 27-folds. With the guide of the rude classification in the first layer, the second layer can achieve higher performance through filtering the irrelevant samples in the training set. Third, we use GA to generate weights for the outputs of the second layer to get the overall classification accuracy.
Our approach delivers a good performance on current widely used datasets. Although protein fold recognition is a challenging issue, GAOEC provides good insights to improve the performance.
| Funding |
|---|
|
|
|---|
This work is supported by the National Natural Science Foundation of China (Grant No. 60375021) and the Hunan Provincial foundation for Distinguished Young Scholars (Grant No. 05JJ10011).
| Appendix A |
|---|
|
|
|---|
For readers convenience, we list the 27-folds, the symbols and logograms used in this article in Tables A1
Table A1. The 27-folds used in this article
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ntrain: number of folds used in the training set.
Ntest: number of folds used in the testing set.
Table A2. The logograms and their full names
|
Table A3. The symbols and their meanings
|
| Footnotes |
|---|
Edited by Stefano Gianni
| Acknowledgements |
|---|
|
|
|---|
The authors would like to thank Dr Bodong Li for his help on experiment, Dr Caiyan Jia and Shaoping Ling for their constructive advice. The authors also wish to thank C.H.Q. Ding at Lawrence Berkeley National Laboratory for sharing datasets.
| References |
|---|
|
|
|---|
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J.H., Zhang Z., Miller W., Lipman D.J. Nucleic Acids Res. (1997) 25:3389–3402.
Bindewald E., et al. Protein Eng. (2003) 16:785–789.
Baldi P., Brunak S., Chauvin Y., Andersen C., Nielen H. Bioinformatics (2000) 16:412–424.
Cheng J.L., Baldi P. Bioinformatics (2006) 22:1456–1463.
Chinnasamy A., Sung W.K., Mittal A. Pacific Symposium on Biocomputing—Altman R., Keith A., Hunter L., Jung T., Klein T., eds. (2003) 9:387–398.
Craven M.W., Mural R.J., Hauser L.J., Uberbacher E,C. ISMB (1995) 3:98–106.[Medline]
Deb K., Prata A., Agarwal S., Meyarivan T. IEEE Trans. Evol. Comput. (2002) 6:182–197.[CrossRef]
Denoeux T. IEEE Trans. Syst. Man Cybern. (1995) 25:804–813.[CrossRef][Web of Science]
Ding C.H., Dubchak I. Bioinformatics (2001) 17:349–358.
Han S., Lee B.C., Yu S.T., Jeong C.S., Lee S., Kim D. Bioinformatics (2005) 21:2667–2673.
Huang C.D., Lin C.T., Pal N.R. IEEE Trans. Nanobiosci. (2003) 4:221–232.
Gewehr J.E., von Öhsen N., Zimmer R. German Conference on Bioinformatics. (2004) 141–148.
Nanni L. Neurocomputing (2005) 68:317–321.
Nanni L. Neurocomputing (2006) a 69:2434–2437.[CrossRef][Web of Science]
Nanni L. Neurocomputing (2006) b 69:850–853.[CrossRef][Web of Science]
Rost B., Sander C. J. Mol. Biol. (1993) 232:584–599.[CrossRef][Web of Science][Medline]
Söding J. Bioinformatics (2005) 21:951–960.
Shen H.B., Chou K.C. Bioinformatics (2006) 22:1717–1722.
Tan A.C., Gilbert D., Deville Y. Genome Inform. (2003) 16:206–217.
Xu J., Xu Y., Lin G., Kim D., Li M. In Pac Symp Biocomput (2003).
Zhou H., Zhou Y. Proteins (2004) 55:1005–1013.[CrossRef][Web of Science][Medline]
Zouhal L.M., Denoeux T. IEEE Trans. Syst. Man Cybern. (1998) 28:263–271.[CrossRef][Web of Science]
Received November 14, 2007; revised July 29, 2008; accepted August 1, 2008.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
Q. Dong, S. Zhou, and J. Guan A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation Bioinformatics, October 15, 2009; 25(20): 2655 - 2662. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




