Skip Navigation



PEDS Advance Access published online on January 23, 2007

Protein Engineering Design and Selection, doi:10.1093/protein/gzl053
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
20/1/39    most recent
gzl053v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, H.-B.
Right arrow Articles by Chou, K.-C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, H.-B.
Right arrow Articles by Chou, K.-C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oxfordjournals.org

Article

Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins

Hong-Bin Shen1 and Kuo-Chen Chou1,2,3

1 Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 1954 Hua-Shan Road, Shanghai 200030, China 2 Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA

3 To whom correspondence should be addressed. E-mail: kchou{at}san.rr.com


    Abstract
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
A statistical analysis indicated that, of the 35 016 Gram-positive bacterial proteins from the recent Swiss-Prot database, ~57% of these entries are without subcellular location annotations. In the gene ontology database, the corresponding percentage is ~67%, meaning the percentage of proteins without subcellular component annotations is even higher. With the avalanche of gene products generated in the post-genomic era, the number of such location-unknown entries will continuously increase. It is highly desired to develop an automated method for timely and accurately identifying their subcellular localization because the information thus obtained is very useful for both basic research and drug discovery practice. In view of this, an ensemble classifier called ‘Gpos-PLoc’ was developed for predicting Gram-positive protein subcellular localization. The new predictor is featured by fusing many basic classifiers, each of which was engineered according to the optimized evidence-theoretic K-nearest neighbors rule. As a demonstration, tests were performed on Gram-positive proteins among the following five subcellular location sites: (1) cell wall, (2) cytoplasm, (3) extracell, (4) periplasm and (5) plasma membrane. To eliminate redundancy and homology bias, only those proteins which have < 25% sequence identity to any other in a same subcellular location were allowed to be included in the benchmark datasets. The overall success rates thus achieved by Gpos-PLoc were > 80% for both jackknife cross-validation test and independent dataset test, implying that Gpos-PLoc might become a very useful vehicle for expediting the analysis of Gram-positive bacterial proteins. Gpos-PLoc is freely accessible to public as a web-server at http://202.120.37.186/bioinf/Gpos/. To support the need of many investigators in the relevant areas, a downloadable file is provided at the same website to list the results identified by Gpos-PLoc for 31 898 Gram-positive bacterial protein entries in Swiss-Prot database that either have no subcellular location annotation or are annotated with uncertain terms such as ‘probable’, ‘potential’, ‘perhaps’ and ‘by similarity’. Such large-scale results will be updated once a year to include the new entries of Gram-positive bacterial proteins and reflect the continuous development of Gpos-PLoc.

Keywords: amphiphilic pseudo amino acid composition/fusion/gene ontology/Gram-positive/OET-KNN rule


    Introduction
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
Bacteria are both harmful and useful to the environment, and animals, including humans. Therefore, the information about the proteins in the bacterial cell, such as their function and subcellular location, will be very useful for screening candidates in drug design, or selecting proteins for a special target. Bacteria can be divided into two groups: Gram-positive and Gram-negative. Gram-positive bacteria are those that are stained dark blue or violet by Gram staining; whereas Gram-negative bacteria cannot retain the stain, instead taking up the counter-stain and appearing red or pink.

An extensive study has been conducted in developing methods for predicting Gram-negative protein subcellular location (see, e.g. Nakai and Kanehisa, 1991Go; Nakai and Horton, 1999Go; Nakai, 2000Go; Gardy et al., 2003Go). However, so far very few reports have been seen for Gram-positive proteins in this regard.

According to the Swiss-Prot database (Bairoch and Apweiler, 2000Go), version 50.0 released on 30th May 2006, the number of total Gram-positive protein entries is 35 889. After excluding those annotated as ‘fragment’ or containing < 50 amino acid residues, the number is reduced to 35 016, of which 15 236 are with subcellular location annotations (Item 1 of Table I). However, of the 15 236 proteins, 3118 are annotated with experimental observations (Item 2 of Table I) and 12 118 annotated with uncertain labels such as ‘probable’, ‘potential’, ‘perhaps’ and ‘by similarity’ (Item 3 of Table I). The uncertain annotations cannot be used as robust data for training a solid predictor. Actually, proteins with uncertain annotations also belong to the targets of identification either by newly developed predictors or by further experiments.


View this table:
[in this window]
[in a new window]

 
Table I.. Breakdown of the 35 016a Gram-positive protein sequence entries from Swiss-Prot database (version 50.0, released 30th May 2006) according to the nature of their subcellular location annotation and their expression in the GO database

 
Such a gap would become even wider if a similar statistical analysis was conducted based on the gene ontology (GO) database (Ashburner et al., 2000Go), which was established according to molecular function, biological process and cellular component. As shown in Item 5 of Table I, of the 35 016 Gram-positive proteins, only 11 436/35 016 = 32.7% have GO annotations to indicate their subcellular components; i.e. the percentage for the proteins with subcellular location annotations in the GO database is even less than that of the Swiss-Prot database. Besides, since the GO database was derived from various other databases, including Swiss-Prot, the GO annotations might be contaminated by the uncertain information from the 12 118 entries as indicated in Item 3 of Table I.

Therefore, the number of Gram-positive proteins that have reliable subcellular location annotations is 3118 (Item 2 of Table I), which is about 9% of all the Gram-positive protein entries concerned. In other words, there are (35 016 – 3118) = 31 898 Gram-positive proteins whose subcellular locations need to be identified or further confirmed.

With the rapidly increasing of gene products in the post-genomic era, it is expected that the gap between the newly found protein sequences and the knowledge of their subcellular location will be continuously enlarged. For timely utilizing these new proteins for basic research and drug discovery (Chou, 2004Go; Lubec et al., 2005Go), it is highly desired to develop an effective method to bridge such gap, and the present study was initiated in an attempt to address the challenge with a focus on Gram-positive proteins.


    Materials
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
Protein sequences were collected from the Swiss-Prot database (Bairoch and Apweiler, 2000Go) version 50.0 at http://www.ebi.ac.uk/swissprot/ released on 30th May 2006 according to the annotation information in the CC (comment or notes) and OC (organism classification) fields. In order to collect as much desired information as possible, but meanwhile ensure a high quality for the working datasets, the data were screened strictly according to the following criteria. (1) Only those sequences annotated with ‘firmicutes’ and ‘actinobacteria’ in the OC field were collected because the current study was focused on Gram-positive proteins only. (2) Because a same subcellular location (-!-SUBCELLULAR LOCATION) in the CC field might be annotated with different terms, several key words might be used for a same subcellular location. For example, in search for cytoplasm proteins, the key words ‘cytoplasm’ and ‘cytoplasmic’ were used; in search for extracell proteins, the key words ‘extracell’, ‘extracellular’ and ‘secreted’ were used; in search for periplasm proteins, the keywords ‘periplasm’ and ‘periplasmic’ were used; in search for plasma membrane proteins, the key words ‘plasma membrane’, ‘integral membrane’, ‘multi-pass membrane’ and ‘single-pass membrane’ were used. (3) Sequences annotated with ambiguous or uncertain terms, such as ‘potential’, ‘probable’, ‘probably’, ‘maybe’, ‘likely’ or ‘by similarity’, were excluded. (4) Sequences annotated by two or more locations were not included because of lack of the uniqueness. For example, proteins with subcellular location annotated with ‘cytoplasm and plasma membrane’ were excluded. (5) Sequences annotated with ‘fragment’ were excluded; also, sequences with < 50 amino acid residues were removed because they might just be fragments. (6) To avoid any homology bias, a redundancy cutoff was operated by a culling program (Wang and Dunbrack, 2003Go) to winnow those sequences which have ≥ 25% sequence identity to any other in a same subcellular location.

After strictly following the above six criteria, we obtained 452 Gram-positive proteins, of which 14 belonged to cell wall, 196 to cytoplasm, 108 to extracell, 5 to periplasm and 129 to plasma membrane (Fig. 1). It is instructive to point out here that, for quite a long time, it was thought by many that there was no periplasm in Gram-positive bacteria. However, with the technique of cryo-electron microscopy, the existence of a periplasmic space between the plasma membrane and the thick peptidoglycan layer of the gram-positive bacteria was indeed observed very recently (Zuber et al., 2006Go), further supporting the classification scheme as illustrated in Fig. 1. The 452 Gram-positive proteins thus obtained form a dataset S0, which is a union of the following five subsets

Formula 053M1

1
On the basis of dataset S0, two working datasets, i.e. a learning (training) dataset SL and an independent testing dataset ST, were constructed. In order to fully use the data in S0 and meanwhile guarantee that SL and ST be completely independent of each other, the following condition was imposed

Formula 053M2

2
where {cup} , {cap} and Ø represent the symbols for ‘union’, ‘intersection’ and ‘empty set’ in the set theory, respectively. Protein samples in the corresponding subsets of SL and ST are randomly assigned according to the following ‘bracket percentage distribution’ criterion

Formula 053M3

3
where ni0, niL and niT are the numbers of protein samples in the ith subset of the original dataset S0, learning dataset SL and testing dataset ST, respectively, and the symbol INT means taking the integer part for the number in the brackets right after it. The numbers of proteins thus obtained for the five subcellular locations in the learning dataset SL and testing dataset ST are given in Table II. The accession numbers and sequences for the corresponding proteins in the learning and testing datasets are given in the Online Supplementary Materials A and B, respectively.


Figure 0531
View larger version (56K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1.. Schematic illustration to show the five subcellular locations of Gram-positive proteins: (1) cell wall, (2) cytoplasm, (3) extracell, (4) periplasm and (5) plasma membrane.

 

View this table:
[in this window]
[in a new window]

 
Table II.. Number of Gram-positive proteins in each of the five subcellular locations for the learning and testing datasets, respectively

 

    Method
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
The sequential model and discrete model are often used for predicting protein subcellular location. In the sequential model, the sample of a protein is represented by its amino acid sequence, and the sequence similarity search-based tools like BLAST are used to conduct prediction. However, this approach fails to work when the query protein does not have significant homology to proteins of known location. In the discrete model, the sample of a protein is represented by a set of discrete numbers. The simplest one is the AA-discrete model in which the sample of a protein is represented by its amino acid composition (AA) (e.g. see Nakashima and Nishikawa, 1994Go; Cedano et al., 1997Go; Chou and Elrod, 1999Go; Zhou and Doctor, 2003Go). In the AA-discrete model, all the sequence-order effects are lost. To avoid completely lose the sequence-order information, the PseAA-discrete model was introduced (Chou, 2001Go) that can reflect the sequence-order information (at least partially) through a set of correlation factors called ‘pseudo amino acid component’ (PseAA), and the prediction quality has been remarkably improved (see e.g. Feng, 2002Go; Chou and Cai, 2003bGo; Shen and Chou, 2005aGo; Wang et al., 2005Go; Xiao et al., 2005Go; Zhang et al., 2006Go). The FunD-discrete model is the one (Chou and Cai, 2002Go; Cai et al., 2003Go) in which the sample of a protein is represented by the functional domain composition (FunD). The FunD-discrete model is particularly effective for predicting protein structural class (Chou and Cai, 2004aGo) and protease type (Chou and Cai, 2006Go).

The current prediction predictor is called Gpos-PLoc, which was established on the basis of two cornerstones: one is the GO-PseAA discrete model and the other the fusion OET-KNN (optimized evidence-theoretic K-nearest neighbors) operating engine. The former is for formulating the protein samples by hybridizing GO (Ashburner et al., 2000Go) and the amphiphilic PseAA (Chou, 2005Go), as detailed in Chou and Shen (2006)Go; whereas the latter is a powerful ensemble classifier formed by fusing many basic individual classifiers each of which is engineered according to the OET-KNN rule (Denoeux, 1995Go; Keller et al., 1985Go).

One of the important advantages in using the GO-PseAA discrete model is that the protein samples mapped into the GO space will be clustered in a way distinctly correlated with their subcellular locations, so as to result in a high prediction quality even if the numbers of proteins in some of the training subsets are not as large as usually required. The procedures in using the GO-PseAA discrete model to represent a protein sample can be described as follows.

Mapping UniProtKB/Swiss-Prot protein entries (Apweiler et al., 2004Go) to the GO database, one can get a list of data called ‘gene_association.goa_uniprot’, where each UniProtKB/Swiss-Port protein entry corresponds to one or several GO numbers. In this study, such a data file was directly downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ (released on 4th March 2006). As shown in Table III, the relationships between the UniProtKB/Swiss-Port protein entries (accession numbers) and the GO numbers may be one-to-many, ‘reflecting the biological reality that a particular protein may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell’ (Ashburner et al., 2000Go). It can be seen from Table III that, for those proteins with ‘subcellular location unknown’ annotation in Swiss-Prot database, their corresponding GO numbers in GO database are also annotated with ‘cellular component unknown’ (e.g. proteins with accession numbers P82679 [GenBank] , Q9RC23 and Q7S1D3), and that even for some proteins whose subcellular locations are clearly annotated in Swiss-Prot database, their corresponding GO numbers in GO database are annotated with ‘cellular component unknown’ (e.g. protein with accession number P00782 [GenBank] ).


View this table:
[in this window]
[in a new window]

 
Table III.. Examples to show the subcellular location annotations for some Gram-positive bacterial proteins in the Swiss-Prot database and the annotations of their GO numbers in the GO database

 
Also, because the current GO database is not complete yet, some protein entries (such as ‘P0A5B7’, ‘O53077’ and ‘P0A5Q2’) have no corresponding GO numbers, i.e. no mapping records at all in the GO database, and hence are not included in the data list of gene_association.goa_uniprot.

Furthermore, the GO numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000001, GO:0000002, GO:0000003, GO:0000004, GO:0000006, ... , GO:0051912 would become GO_compress: 0000001, GO_compress:0000002, GO_compress:0000003, GO_compress:0000004, GO_compress:0000005, ... , GO_compress:0009918, respectively. The GO database thus obtained is called GO_compress database, whose dimensions were reduced from 51 912 in the original GO database to 9918. Each of the 9918 entities in the GO_compress database served as a base to define a protein sample.

Unfortunately, the current GO numbers failed to give a complete coverage in the sense that some proteins might not belong to any of the GO numbers as mentioned above. Although the problem will gradually become trivial or eventually be solved with the continuous development of the GO database, to tackle such a problem right now, a hybridization approach was introduced by fusing the GO approach and the amphiphilic pseudo amino acid composition (PseAA) approach (Chou, 2005Go; Chou and Cai, 2005Go), as described below.

Step 1: Search a protein sample in the GO_compress database, if there is a hit corresponding to the ith GO_compress number, then the ith component of the protein in the 9918-D (dimensional) GO_compress space is assigned 1; otherwise, 0. Thus, the protein can be formulated as

Formula 053M4

4
where T is the transpose operator, and

Formula 053M5

5

Step 2: If no hit (i.e. no record in the GO_compress database) is found whatsoever, then the protein should be defined in the (20 + 2{lambda})-D amphiphilic PseAA space (Chou, 2005Go), as given below

Formula 053M6

6
where p1, p2, ... , p20 are associated with the amino acid composition reflecting the occurrence frequencies of the 20 native amino acids in the protein (Chou and Zhang, 1993Go, 1994Go), and p20+1, p20+2, ... , p20+2{lambda} are the 2{lambda} correlation factors that reflect its sequence-order pattern through the amphiphilic feature. The protein representation as defined by equation (6) is called the ‘amphiphilic pseudo amino acid composition’ or PseAA, which has the same form as the conventional amino acid composition but contains more components and information. The components in equation (6) can be easily derived according to equations (2)–(6) of Chou (2005)Go.

Suppose there are N proteins (P1, P2, ... , PN) which have been classified into M subsets (subcellular locations). For the current case, we have M = 5. Now, for a query protein P, how can we identify which subset it belongs to? Subsequently we shall use the OET-KNN rule (Cover and Hart, 1967Go; Keller et al., 1985Go; Denoeux, 1995Go) to deal with this problem. The key of OET-KNN algorithm is to predict a query protein P belonging to the subset that has the highest evidence derived from the K-nearest neighbors of P. For reader's convenience, a brief introduction about OET-KNN classifier and its key equations are given in Appendix 1. There are many different definitions to measure the ‘nearness’ for the OET-KNN classifier, such as Euclidean distance, Hamming distance (Mardia et al., 1979Go) and Mahalanobis distance (Mahalanobis, 1936Go; Pillai, 1985Go; Chou, 1995Go). Here, we use the following equation to measure the nearness between protein P and Pi

Formula 053M7

7
where P  ·  Pi is the dot product of the two vectors, and ||P|| and ||Pi|| their modulus, respectively. According to equation (7), when P {equiv} Pi we have {delta}(P, Pi) = 0, indicating the ‘distance’ between the two proteins is zero and hence they have perfect or 100% similarity.

Using the OET-KNN rule, the predicted result will depend on the selection of the parameter K, the number of the nearest neighbors to the query protein P. Because the predicted results by the OET-KNN algorithm (Cover and Hart, 1967Go; Keller et al., 1985Go; Denoeux, 1995Go) depend on the selection of parameter K, hereafter we shall use OET-NN(K) to represent the symbol of OET-KNN, implying that the predicted result is the function of K, the number of the nearest neighbors concerned for the query protein P.

During the course of prediction, the following self-consistency principle should be followed. If a query protein could be defined in the 9918-D GO_compress space (equation (4)), then the prediction should be carried out based on those proteins in the training dataset that could be defined in the same 9918-D space. If the query protein in the 9918-D GO_compress space was a naught vector and hence must be defined instead in the (20 + 2{lambda})-D or {Lambda}-D PseAA space (equation (6)), then the prediction should be conducted according to the self-consistency principle that all the proteins in the training dataset be defined in the same {Lambda}-D space as well. Accordingly, the current hybridization predictor actually consists of two sub-predictors: (a) the OET-NN(K)-GO predictor that operates in the 9918-D GO_compress space, and (b) the OET-NN(K,{Lambda})-PseAA predictor that operates in the {Lambda}-D amphiphilic PseAA space. The former is the function of K, whereas the latter the function of both K and {Lambda}. For a given learning dataset, selection of different K and {Lambda} would result in different outcomes. To get the optimal success rate, one has to test the results by using different numbers of K and {Lambda} one by one. However, it is both time-consuming and tedious to do so. To solve such a problem, the following two fusion processes are introduced for the OET-NN(K) and OET-NN(K, {Lambda}) classifiers, respectively.

One-dimensional fusion

It is for generating an ensemble classifier by fusing many individual basic OET-NN(K) classifiers each having a different specified value of K, as formulated by

Formula 053M8

8
where the symbol {forall} denotes the fusing operator and OET-NNGO the ensemble classifier formed by fusing OET-NN(1), OET-NN(2), ... , and OET-NN({Omega}). Here {Omega} = 10 because preliminary tests indicated that the success rate obtained by the OET-NN(K) classifier trained by the current learning dataset was lower when K > 10.

The process of how the ensemble classifier OET-NNGO works is as follows. Suppose the predicted classification results for the query protein P by the 10 individual classifiers in equation (8) are C1, C2, ... , C10, respectively; i.e.

Formula 053M9

9
where isin is a symbol in the set theory meaning ‘element of’, and S1, S2, S3, S4, S5 represent the five subsets defined by the five subcellular locations studied here (Fig. 1), and the voting score for the protein P belonging to the kth subset is defined by

Formula 053M10

10
where wi is the weight and was set at 1 for simplicity, and the delta function in equation (10) is given by

Formula 053M11

11
thus, the query protein P is predicted belonging to the subset (subcellular location) with which its score of equation (10) is the highest.

Two-dimensional fusion

It is for generating an ensemble classifier by fusing many individual basic OET-NN(K,{Lambda}) classifiers each having different specified values of K and {Lambda}. Owing to the similar reason as mentioned above in setting the value of {Omega} for equation (8), let us consider K = 1, 2, ... , 10, and {Lambda} = 20, 22, ... , 60; i.e.

Formula 053M12

12
Thus, the ensemble classifier obtained by the two-dimensional fusion process can be formulated as

Formula 053M13

13
where the fusion operator {forall} has the same meaning as that of equation (8). The detailed process of how the ensemble classifier OET-NNPse works is as follows. Suppose the predicted classification results for the query protein P by the 10 x 21 = 210 individual classifiers in equation (13) are

Formula 053M14

14
where S1, S2, S3, S4, S5 have the same meanings as in equation (9), i.e. represent the five subsets defined by the five subcellular locations studied here (Fig. 1), and the voting score for the query protein P belonging to the kth subset is defined by

Formula 053M15

15
where wi,2j is the weight and was set at 1 for simplicity, the delta function in equation (15) is given by

Formula 053M16

16
thus, the query protein P is predicted belonging to the subset (subcellular location) for which its score of equation (16) is the highest.

Finally, it should be pointed out that, although using GO database to predict protein subcellular location has been explored by previous investigators (Chou and Cai, 2003aGo, 2004bGo), the predictors formulated there has much less power than the current predictor owing to the following reasons. (a) The GO approach in Chou and Cai (2003aGo, 2004bGo) was operated by the nearest neighbor rule with K = 1 only, which is much less powerful than ensemble classifier as formulated in equations (8) and (13). (b) The dimension of the GO database space in Chou and Cai (2003aGo, 2004bGo) is 1930, but the dimension of GO database space here is 9918, indicating the need to catch up with the rapid development in GO. Besides, it is through Tables I and III presented here that the relationship between GO and Swiss-Prot is more clearly elucidated than in the previous papers (Chou and Cai, 2003aGo, 2004bGo).


    Results and discussion
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
For the proteins listed in the Online Supplementary Materials A and B, we obtained the following results according to Steps 1–2 of Methods section: (a) of the 220 Gram-positive proteins in the learning dataset, 211 got hits in the GO_compress database, and hence were defined in the 9918-D GO_compress space (equations (4) and (5)), and the remaining 9 proteins defined in the {Lambda}-D PseAA space (equation (6)); (b) of the 232 proteins in the testing dataset, 223 got hits and were defined in the 9918-D GO_compress space, and the remaining 9 protein was defined in the {Lambda}-D PseAA space. Although for the benchmark datasets studied here, the number of proteins that could be meaningfully defined in the 9918-D GO_compress space was overwhelming, this does not mean that there is no need to include the OET-NNPse predictor because, as shown in Table I, currently there still are ~5.4% of Gram-positive proteins that have no any corresponding GO number. Therefore, in practical application, cases do exist where the query proteins may not be meaningfully defined in the GO system.

Although such a problem will be eventually solved with the continuous development of the GO database, it would be harmless and makes the predictor more complete to keep the OET-NNPse classifier in the system since the prediction process is logically operated according to the following criterion or hierarchy: if a query protein can be defined in the 9918-D GO_compress space, then the classifier OET-NNGO is used to predict its subcellular location; otherwise, the classifier OET-NNPse is used.

The prediction quality was tested by jackknife cross-validation and independent dataset validation. The jackknife test is thought one of the most rigorous and objective methods for cross-validation in statistics (see Chou and Zhang, 1995Go for a comprehensive review) and have been increasingly used by investigators (Zhou, 1998Go; Feng, 2001Go; Zhou and Assa-Munt, 2001Go; Feng, 2002Go; Luo et al., 2002Go; Liu et al., 2005Go; Wang et al., 2005Go; Guo et al., 2006Go; Sun and Huang, 2006Go; Wen et al., 2006Go; Xiao et al., 2006Go; Zhang et al., 2006Go) in examining the accuracy of various prediction methods. Therefore, the power of a predictor should be measured by the success rate of jackknife test. The independent dataset test performed here was just for a demonstration of practical application.

The success rates obtained by jackknife and independent dataset tests for each of the 5 Gram-positive protein subcellular localization sites are given in Table IV, from which we can see that, for those subcellular locations with ~60 protein samples, the success rates are quite high, and that the overall success rates by the jackknife test and independent dataset test are 82.7% and 84.1%, respectively. Therefore, it is expected that, with more data available to improve the learning dataset, particularly the subsets with < 20 protein samples, the success rates will be further enhanced.


View this table:
[in this window]
[in a new window]

 
Table IV.. The success rates by jackknife and independent dataset tests for each of the five Gram-positive bacterial protein subcellular localization sites (Fig. 1)

 

    Conclusion
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
With the explosion of newly found protein sequences entering into protein databanks, prediction of protein subcellular locations has become increasingly important. In this article, the GO-discrete model was introduced to represent the sample of a protein. On the basis of such a frame of representation, the ensemble classifier Gpos-PLoc was developed for predicting the subcellular location of Gram-positive bacterial proteins. Gpos-PLoc was formed by fusing many basic classifiers with each engineered by the OET-KNN rule. The high success rates indicate that proteins, if represented through the GO-discrete model, can be more distinctly clustered according to their different subcellular locations, and that the ensemble classifier presented here is indeed a powerful operator in distinguishing these clusters. Gpos-PLoc is freely available as a web-server to public.


    Appendix 1: The optimized evidence-theoretic K-nearest neighbors (OET-KNN) classifier
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
For reader's convenience, a brief introduction of the OET-KNN classifier is given below. For further explanation, refer to Shen and Chou (2005b)Go. Let us consider a problem of classifying N entities into M classes (subcellular locations), which can be formulated as

Formula 053MA1

A1
The available information is assumed to consist in a training dataset

Formula 053MA2

A2
where the N entities Pi(i = 1, 2, ... , N) and their corresponding pattern (class) labels {theta}i(i = 1, 2, ... , N) take values in F of equation (A1). According to the KNN (K-nearest neighbors) rule (Cover and Hart, 1967Go), an unclassified entity P is assigned to the class represented by a majority of its K-nearest neighbors of P. Owing to its good performance and simple-to-use feature, the KNN rule, also named as ‘voting KNN rule’, is quite popular in pattern recognition community.

The ET-KNN (evidence theoretic K-nearest neighbors) rule is a pattern classification method based on the Dempster–Shafer theory of belief functions (Denoeux, 1995Go). In the classification process, each neighbor of a pattern to be classified is considered as an item of evidence supporting certain hypotheses concerning the class membership of that pattern. Based on this evidence, basic belief masses are assigned to each subset concerned. Such masses are obtained for each of the K-nearest neighbors of the pattern under consideration and aggregated using the Dempster's rule of combination (Shafer, 1976Go). A decision is made by assigning a pattern to the class with the maximum credibility.

Suppose P is a query protein to be classified, and SKP is the set of its K-nearest neighbors in the training dataset N of equation (A2). Thus, for any Pi isin SKP, the knowledge that Pi belongs to class {Phi}µ isin F can be considered as a piece of evidence that increases our belief that P also belongs to {Phi}µ. According to the basic belief assignment mapping theory (Shafer, 1976Go), this item of evidence can be formulated by

Formula 053MA3

A3
where {alpha}0 is a fixed parameter, {gamma}µ a parameter associated with class {Phi}µ and D2 (Pi, P) the square Euclidean distance between P and Pi. In the ET-KNN rule, it was not addressed how to optimally select the parameters. In 1998, an optimization procedure to determine the optimal or near-optimal parameter values was proposed from the data by minimizing an error function (Zouhal and Denoeux, 1998Go). It was observed that the OET-KNN rule obtained through such an optimization treatment would lead to a substantial improvement in classification accuracy.

The belief function of P belonging to class {Phi}µ is a combination of its K-nearest neighbors, and can be formulated as

Formula 053MA4

A4
where {oplus} is called the orthogonal sum, which is commutative and associative, and hence equation (A4) can be expressed as

Formula 053MA5

A5
where the symbol {oplus}i=1K represents the orthogonal sum from i = 1 to K. According to Dempster's rule (Shafer, 1976Go), the belief function of equation (A5) can be expressed as

Formula 053MA6

A6
where SK,iP is the ith possible subset of SKP, and {subseteq}, {cap} and {emptyset} are the symbols in set theory, representing ‘contained in’, ‘intersection’ and the empty set, respectively.

A decision is made by assigning the query protein P to the class with which the belief or credibility function of equation (A6) has the maximum value; i.e. if

Formula 053MA7

A7
where µ = 1, 2, ... , or M and the operator Max means taking the maximum one among those in the brackets, then the class {Phi}µ is the class predicted for the query protein P.


    Footnotes
 
Edited by Micheal Deem Back


    References
 Top
 Abstract
 Introduction
 Materials
 Method
 Results and discussion
 Conclusion
 Appendix 1: The optimized...
 References
 
Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., et al. (2004) Nucleic Acids Res. 32:D115–D119.[Abstract/Free Full Text]

Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. (2000) Nat. Genet. 25:25–29.[CrossRef][Web of Science][Medline]

Bairoch A. and Apweiler R. (2000) Nucleic Acids Res. 25:31–36.

Cai Y.D., Zhou G.P., Chou K.C. (2003) Biophys. J. 84:3257–3263.

Cedano J., Aloy P., P'erez-Pons J.A., Querol E. (1997) J. Mol. Biol. 266:594–600.[CrossRef][Web of Science][Medline]

Chou J.J. and Zhang C.T. (1993) J. Theor. Biol. 161:251–262.[CrossRef][Web of Science][Medline]

Chou K.C. (1995) Proteins: Struct., Funct. Genet. 21:319–344.[CrossRef][Web of Science][Medline]

Chou K.C. (2001) Proteins: Struct., Funct. Genet. 43:246–255 (Erratum: ibid., 2001, 44, 60).[CrossRef][Web of Science][Medline]

Chou K.C. (2004) Curr. Med. Chem. 11:2105–2134.[Web of Science][Medline]

Chou K.C. (2005) Bioinformatics 21:10–19.[Abstract/Free Full Text]

Chou K.C. and Cai Y.D. (2002) J. Biol. Chem. 277:45765–45769.[Abstract/Free Full Text]

Chou K.C. and Cai Y.D. (2003a) Biochem. Biophys. Res. Commun. 311:743–747.[CrossRef][Web of Science][Medline]

Chou K.C. and Cai Y.D. (2003b) J. Cell. Biochem. 90:1250–1260 (Addendum: ibid., 2004, 91, 1085).[CrossRef][Web of Science][Medline]

Chou K.C. and Cai Y.D. (2004a) Biochem. Biophys. Res. Commun. 321:1007–1009 (Corrigendum: ibid., 2005, 329, 1362).[CrossRef][Web of Science][Medline]

Chou K.C. and Cai Y.D. (2004b) Biochem. Biophys. Res. Commun. 320:1236–1239.[CrossRef][Web of Science][Medline]

Chou K.C. and Cai Y.D. (2005) J. Chem. Inform. Model. 45:407–413.[CrossRef]

Chou K.C. and Cai Y.D. (2006) Biochem. Biophys. Res. Commun. 339:1015–1020.[CrossRef][Web of Science][Medline]

Chou K.C. and Elrod D.W. (1999) Protein Eng. 12:107–118.[Abstract/Free Full Text]

Chou K.C. and Shen H.B. (2006) J. Proteome Res. 5:1888–1897.[CrossRef][Web of Science][Medline]

Chou K.C. and Zhang C.T. (1994) J. Biol. Chem. 269:22014–22020.[Abstract/Free Full Text]

Chou K.C. and Zhang C.T. (1995) Crit. Rev. Biochem. Mol. Biol. 30:275–349.[Web of Science][Medline]

Cover T.M. and Hart P.E. (1967) IEEE Trans. Inform. Theory IT-13:21–27.[CrossRef]

Denoeux T. (1995) IEEE Trans. Syst, Man Cybern. 25:804–813.[CrossRef]

Feng Z.P. (2001) Biopolymers 58:491–499.[CrossRef][Web of Science][Medline]

Feng Z.P. (2002) In Silico Biol. 2:291–303.[Medline]

Gardy J.L., Spencer C., Wang K., Ester M., Tusnady G.E., Simon I., Hua S., deFays K., Lambert C., Nakai K., et al. (2003) Nucleic Acids Res. 31:3613–3617.[Abstract/Free Full Text]

Guo Y.Z., Li M., Lu M., Wen Z., Wang K., Li G., Wu J. (2006) Amino Acids 30:397–402.[CrossRef][Web of Science][Medline]

Keller J.M., Gray M.R., Givens J.A. (1985) IEEE Trans. Syst. Man Cybern. 15:580–585.

Liu H., Yang J., Ling J.G., Chou K.C. (2005) Biochem. Biophys. Res. Commun. 338:1005–1011.[CrossRef][Web of Science][Medline]

Lubec G., Afjehi-Sadat L., Yang J.W., John J.P. (2005) Prog. Neurobiol. 77:90–127.[CrossRef][Web of Science][Medline]

Luo R.Y., Feng Z.P., Liu J.K. (2002) Eur. J. Biochem. 269:4219–4225.[Web of Science][Medline]

Mahalanobis P.C. (1936) Proc. Natl. Inst. Sci. India 2:49–55.

Mardia K.V., Kent J.T., Bibby J.M. (1979) Multivariate Analysis(Academic Press, London) (Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis (pp. 322–381)).

Nakai K. (2000) Adv. Protein Chem. 54:277–344.[Web of Science][Medline]

Nakai K. and Horton P. (1999) Trends Biochem. Sci. 24:34–36.[CrossRef][Web of Science][Medline]

Nakai K. and Kanehisa M. (1991) Proteins: Struct. Funct. Genet. 11:95–110.[CrossRef][Web of Science][Medline]

Nakashima H. and Nishikawa K. (1994) J. Mol. Biol. 238:54–61.[CrossRef][Web of Science][Medline]

Pillai K.C.S. (1985) In Kotz S. and Johnson N.L. (Eds.). Encyclopedia of Statistical SciencesJohn Wiley & Sons 5: pp. 176–181 (This reference also presents a brief biography of Mahalanobis who was a man of great originality and who made considerable contributions to statistics, New York).

Shafer G. (1976) A Mathematical Theory of Evidence(Princeton University Press, Princeton, NJ).

Shen H.B. and Chou K.C. (2005a) Biochem. Biophys. Res. Commun. 337:752–756.[CrossRef][Web of Science][Medline]

Shen H.B. and Chou K.C. (2005b) Biochem. Biophys. Res. Commun. 334:288–292.[CrossRef][Web of Science][Medline]

Sun X.D. and Huang R.B. (2006) Amino Acids 30:469–475.[CrossRef][Web of Science][Medline]

Wang G.L. and Dunbrack R.L. Jr. (2003) Bioinformatics 19:1589–1591.[Abstract/Free Full Text]

Wang M., Yang J., Xu Z.J., Chou K.C. (2005) J. Theor. Biol. 232:7–15.[CrossRef][Web of Science][Medline]

Wen Z., Li M., Li Y., Guo Y., Wang K. (2006) Amino Acids doi:10.1007/S00726-006-0341-y.

Xiao X., Shao S., Ding Y., Huang Z., Huang Y., Chou K.C. (2005) Amino Acids 28:57–61.[CrossRef][Web of Science][Medline]

Xiao X., Shao S.H., Huang Z.D., Chou K.C. (2006) J. Comput. Chem. 27:478–482.[CrossRef][Web of Science][Medline]

Zhang S.W., Pan Q., Zhang H.C., Shao Z.C., Shi J.Y. (2006) Amino Acids 30:461–468.[CrossRef][Web of Science][Medline]

Zhou G.P. (1998) J. Protein Chem. 17:729–738.[CrossRef][Web of Science][Medline]

Zhou G.P. and Assa-Munt N. (2001) Proteins: Struct. Funct. Genet. 44:57–59.[CrossRef][Web of Science][Medline]

Zhou G.P. and Doctor K. (2003) Proteins: Struct. Funct. Genet. 50:44–48.[CrossRef][Web of Science][Medline]

Zouhal L.M. and Denoeux T. (1998) IEEE Trans. Syst. Man Cybern. 28:263–271.[CrossRef]

Zuber B., Haenni M., Ribeiro T., Minnig K., Lopes F., Moreillon P., Dubochet J. (2006) J. Bacteriol. 188:6652–6660.[Abstract/Free Full Text]

Received October 11, 2006; revised November 20, 2006; accepted November 22, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
20/1/39    most recent
gzl053v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, H.-B.
Right arrow Articles by Chou, K.-C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, H.-B.
Right arrow Articles by Chou, K.-C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?