PEDS Advance Access published online on January 23, 2007
Protein Engineering Design and Selection, doi:10.1093/protein/gzl053
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Article |
Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins
1 Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 1954 Hua-Shan Road, Shanghai 200030, China 2 Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA
3 To whom correspondence should be addressed. E-mail: kchou{at}san.rr.com
| Abstract |
|---|
|
|
|---|
A statistical analysis indicated that, of the 35 016 Gram-positive bacterial proteins from the recent Swiss-Prot database,
57% of these entries are without subcellular location annotations. In the gene ontology database, the corresponding percentage is
67%, meaning the percentage of proteins without subcellular component annotations is even higher. With the avalanche of gene products generated in the post-genomic era, the number of such location-unknown entries will continuously increase. It is highly desired to develop an automated method for timely and accurately identifying their subcellular localization because the information thus obtained is very useful for both basic research and drug discovery practice. In view of this, an ensemble classifier called Gpos-PLoc was developed for predicting Gram-positive protein subcellular localization. The new predictor is featured by fusing many basic classifiers, each of which was engineered according to the optimized evidence-theoretic K-nearest neighbors rule. As a demonstration, tests were performed on Gram-positive proteins among the following five subcellular location sites: (1) cell wall, (2) cytoplasm, (3) extracell, (4) periplasm and (5) plasma membrane. To eliminate redundancy and homology bias, only those proteins which have < 25% sequence identity to any other in a same subcellular location were allowed to be included in the benchmark datasets. The overall success rates thus achieved by Gpos-PLoc were > 80% for both jackknife cross-validation test and independent dataset test, implying that Gpos-PLoc might become a very useful vehicle for expediting the analysis of Gram-positive bacterial proteins. Gpos-PLoc is freely accessible to public as a web-server at http://202.120.37.186/bioinf/Gpos/. To support the need of many investigators in the relevant areas, a downloadable file is provided at the same website to list the results identified by Gpos-PLoc for 31 898 Gram-positive bacterial protein entries in Swiss-Prot database that either have no subcellular location annotation or are annotated with uncertain terms such as probable, potential, perhaps and by similarity. Such large-scale results will be updated once a year to include the new entries of Gram-positive bacterial proteins and reflect the continuous development of Gpos-PLoc.
Keywords: amphiphilic pseudo amino acid composition/fusion/gene ontology/Gram-positive/OET-KNN rule
| Introduction |
|---|
|
|
|---|
Bacteria are both harmful and useful to the environment, and animals, including humans. Therefore, the information about the proteins in the bacterial cell, such as their function and subcellular location, will be very useful for screening candidates in drug design, or selecting proteins for a special target. Bacteria can be divided into two groups: Gram-positive and Gram-negative. Gram-positive bacteria are those that are stained dark blue or violet by Gram staining; whereas Gram-negative bacteria cannot retain the stain, instead taking up the counter-stain and appearing red or pink.
An extensive study has been conducted in developing methods for predicting Gram-negative protein subcellular location (see, e.g. Nakai and Kanehisa, 1991
; Nakai and Horton, 1999
; Nakai, 2000
; Gardy et al., 2003
). However, so far very few reports have been seen for Gram-positive proteins in this regard.
According to the Swiss-Prot database (Bairoch and Apweiler, 2000
), version 50.0 released on 30th May 2006, the number of total Gram-positive protein entries is 35 889. After excluding those annotated as fragment or containing < 50 amino acid residues, the number is reduced to 35 016, of which 15 236 are with subcellular location annotations (Item 1 of Table I). However, of the 15 236 proteins, 3118 are annotated with experimental observations (Item 2 of Table I) and 12 118 annotated with uncertain labels such as probable, potential, perhaps and by similarity (Item 3 of Table I). The uncertain annotations cannot be used as robust data for training a solid predictor. Actually, proteins with uncertain annotations also belong to the targets of identification either by newly developed predictors or by further experiments.
|
Such a gap would become even wider if a similar statistical analysis was conducted based on the gene ontology (GO) database (Ashburner et al., 2000
Therefore, the number of Gram-positive proteins that have reliable subcellular location annotations is 3118 (Item 2 of Table I), which is about 9% of all the Gram-positive protein entries concerned. In other words, there are (35 016 3118) = 31 898 Gram-positive proteins whose subcellular locations need to be identified or further confirmed.
With the rapidly increasing of gene products in the post-genomic era, it is expected that the gap between the newly found protein sequences and the knowledge of their subcellular location will be continuously enlarged. For timely utilizing these new proteins for basic research and drug discovery (Chou, 2004
; Lubec et al., 2005
), it is highly desired to develop an effective method to bridge such gap, and the present study was initiated in an attempt to address the challenge with a focus on Gram-positive proteins.
| Materials |
|---|
|
|
|---|
Protein sequences were collected from the Swiss-Prot database (Bairoch and Apweiler, 2000
25% sequence identity to any other in a same subcellular location.
After strictly following the above six criteria, we obtained 452 Gram-positive proteins, of which 14 belonged to cell wall, 196 to cytoplasm, 108 to extracell, 5 to periplasm and 129 to plasma membrane (Fig. 1). It is instructive to point out here that, for quite a long time, it was thought by many that there was no periplasm in Gram-positive bacteria. However, with the technique of cryo-electron microscopy, the existence of a periplasmic space between the plasma membrane and the thick peptidoglycan layer of the gram-positive bacteria was indeed observed very recently (Zuber et al., 2006
), further supporting the classification scheme as illustrated in Fig. 1. The 452 Gram-positive proteins thus obtained form a dataset S0, which is a union of the following five subsets
|
| 1 |
|
| 2 |
,
and Ø represent the symbols for union, intersection and empty set in the set theory, respectively. Protein samples in the corresponding subsets of SL and ST are randomly assigned according to the following bracket percentage distribution criterion
|
| 3 |
|
|
| Method |
|---|
|
|
|---|
The sequential model and discrete model are often used for predicting protein subcellular location. In the sequential model, the sample of a protein is represented by its amino acid sequence, and the sequence similarity search-based tools like BLAST are used to conduct prediction. However, this approach fails to work when the query protein does not have significant homology to proteins of known location. In the discrete model, the sample of a protein is represented by a set of discrete numbers. The simplest one is the AA-discrete model in which the sample of a protein is represented by its amino acid composition (AA) (e.g. see Nakashima and Nishikawa, 1994
The current prediction predictor is called Gpos-PLoc, which was established on the basis of two cornerstones: one is the GO-PseAA discrete model and the other the fusion OET-KNN (optimized evidence-theoretic K-nearest neighbors) operating engine. The former is for formulating the protein samples by hybridizing GO (Ashburner et al., 2000
) and the amphiphilic PseAA (Chou, 2005
), as detailed in Chou and Shen (2006)
; whereas the latter is a powerful ensemble classifier formed by fusing many basic individual classifiers each of which is engineered according to the OET-KNN rule (Denoeux, 1995
; Keller et al., 1985
).
One of the important advantages in using the GO-PseAA discrete model is that the protein samples mapped into the GO space will be clustered in a way distinctly correlated with their subcellular locations, so as to result in a high prediction quality even if the numbers of proteins in some of the training subsets are not as large as usually required. The procedures in using the GO-PseAA discrete model to represent a protein sample can be described as follows.
Mapping UniProtKB/Swiss-Prot protein entries (Apweiler et al., 2004
) to the GO database, one can get a list of data called gene_association.goa_uniprot, where each UniProtKB/Swiss-Port protein entry corresponds to one or several GO numbers. In this study, such a data file was directly downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ (released on 4th March 2006). As shown in Table III, the relationships between the UniProtKB/Swiss-Port protein entries (accession numbers) and the GO numbers may be one-to-many, reflecting the biological reality that a particular protein may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell (Ashburner et al., 2000
). It can be seen from Table III that, for those proteins with subcellular location unknown annotation in Swiss-Prot database, their corresponding GO numbers in GO database are also annotated with cellular component unknown (e.g. proteins with accession numbers P82679
[GenBank]
, Q9RC23 and Q7S1D3), and that even for some proteins whose subcellular locations are clearly annotated in Swiss-Prot database, their corresponding GO numbers in GO database are annotated with cellular component unknown (e.g. protein with accession number P00782
[GenBank]
).
|
Also, because the current GO database is not complete yet, some protein entries (such as P0A5B7, O53077 and P0A5Q2) have no corresponding GO numbers, i.e. no mapping records at all in the GO database, and hence are not included in the data list of gene_association.goa_uniprot.
Furthermore, the GO numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000001, GO:0000002, GO:0000003, GO:0000004, GO:0000006, ... , GO:0051912 would become GO_compress: 0000001, GO_compress:0000002, GO_compress:0000003, GO_compress:0000004, GO_compress:0000005, ... , GO_compress:0009918, respectively. The GO database thus obtained is called GO_compress database, whose dimensions were reduced from 51 912 in the original GO database to 9918. Each of the 9918 entities in the GO_compress database served as a base to define a protein sample.
Unfortunately, the current GO numbers failed to give a complete coverage in the sense that some proteins might not belong to any of the GO numbers as mentioned above. Although the problem will gradually become trivial or eventually be solved with the continuous development of the GO database, to tackle such a problem right now, a hybridization approach was introduced by fusing the GO approach and the amphiphilic pseudo amino acid composition (PseAA) approach (Chou, 2005
; Chou and Cai, 2005
), as described below.
Step 1: Search a protein sample in the GO_compress database, if there is a hit corresponding to the ith GO_compress number, then the ith component of the protein in the 9918-D (dimensional) GO_compress space is assigned 1; otherwise, 0. Thus, the protein can be formulated as
|
| 4 |
|
| 5 |
Step 2: If no hit (i.e. no record in the GO_compress database) is found whatsoever, then the protein should be defined in the (20 + 2
)-D amphiphilic PseAA space (Chou, 2005
), as given below
|
| 6 |
are the 2
correlation factors that reflect its sequence-order pattern through the amphiphilic feature. The protein representation as defined by equation (6) is called the amphiphilic pseudo amino acid composition or PseAA, which has the same form as the conventional amino acid composition but contains more components and information. The components in equation (6) can be easily derived according to equations (2)(6) of Chou (2005)
Suppose there are N proteins (P1, P2, ... , PN) which have been classified into M subsets (subcellular locations). For the current case, we have M = 5. Now, for a query protein P, how can we identify which subset it belongs to? Subsequently we shall use the OET-KNN rule (Cover and Hart, 1967
; Keller et al., 1985
; Denoeux, 1995
) to deal with this problem. The key of OET-KNN algorithm is to predict a query protein P belonging to the subset that has the highest evidence derived from the K-nearest neighbors of P. For reader's convenience, a brief introduction about OET-KNN classifier and its key equations are given in Appendix 1. There are many different definitions to measure the nearness for the OET-KNN classifier, such as Euclidean distance, Hamming distance (Mardia et al., 1979
) and Mahalanobis distance (Mahalanobis, 1936
; Pillai, 1985
; Chou, 1995
). Here, we use the following equation to measure the nearness between protein P and Pi
|
| 7 |
Pi we have
(P, Pi) = 0, indicating the distance between the two proteins is zero and hence they have perfect or 100% similarity.
Using the OET-KNN rule, the predicted result will depend on the selection of the parameter K, the number of the nearest neighbors to the query protein P. Because the predicted results by the OET-KNN algorithm (Cover and Hart, 1967
; Keller et al., 1985
; Denoeux, 1995
) depend on the selection of parameter K, hereafter we shall use OET-NN(K) to represent the symbol of OET-KNN, implying that the predicted result is the function of K, the number of the nearest neighbors concerned for the query protein P.
During the course of prediction, the following self-consistency principle should be followed. If a query protein could be defined in the 9918-D GO_compress space (equation (4)), then the prediction should be carried out based on those proteins in the training dataset that could be defined in the same 9918-D space. If the query protein in the 9918-D GO_compress space was a naught vector and hence must be defined instead in the (20 + 2
)-D or
-D PseAA space (equation (6)), then the prediction should be conducted according to the self-consistency principle that all the proteins in the training dataset be defined in the same
-D space as well. Accordingly, the current hybridization predictor actually consists of two sub-predictors: (a) the OET-NN(K)-GO predictor that operates in the 9918-D GO_compress space, and (b) the OET-NN(K,
)-PseAA predictor that operates in the
-D amphiphilic PseAA space. The former is the function of K, whereas the latter the function of both K and
. For a given learning dataset, selection of different K and
would result in different outcomes. To get the optimal success rate, one has to test the results by using different numbers of K and
one by one. However, it is both time-consuming and tedious to do so. To solve such a problem, the following two fusion processes are introduced for the OET-NN(K) and OET-NN(K,
) classifiers, respectively.
It is for generating an ensemble classifier by fusing many individual basic OET-NN(K) classifiers each having a different specified value of K, as formulated by
|
| 8 |
denotes the fusing operator and OET-
GO the ensemble classifier formed by fusing OET-NN(1), OET-NN(2), ... , and OET-NN(
). Here
= 10 because preliminary tests indicated that the success rate obtained by the OET-NN(K) classifier trained by the current learning dataset was lower when K > 10.
The process of how the ensemble classifier OET-
GO works is as follows. Suppose the predicted classification results for the query protein P by the 10 individual classifiers in equation (8) are C1, C2, ... , C10, respectively; i.e.
|
| 9 |
is a symbol in the set theory meaning element of, and S1, S2, S3, S4, S5 represent the five subsets defined by the five subcellular locations studied here (Fig. 1), and the voting score for the protein P belonging to the kth subset is defined by
|
| 10 |
|
| 11 |
It is for generating an ensemble classifier by fusing many individual basic OET-NN(K,
) classifiers each having different specified values of K and
. Owing to the similar reason as mentioned above in setting the value of
for equation (8), let us consider K = 1, 2, ... , 10, and
= 20, 22, ... , 60; i.e.
|
| 12 |
|
| 13 |
has the same meaning as that of equation (8). The detailed process of how the ensemble classifier OET-
Pse works is as follows. Suppose the predicted classification results for the query protein P by the 10 x 21 = 210 individual classifiers in equation (13) are
|
| 14 |
|
| 15 |
|
| 16 |
Finally, it should be pointed out that, although using GO database to predict protein subcellular location has been explored by previous investigators (Chou and Cai, 2003a
, 2004b
), the predictors formulated there has much less power than the current predictor owing to the following reasons. (a) The GO approach in Chou and Cai (2003a
, 2004b
) was operated by the nearest neighbor rule with K = 1 only, which is much less powerful than ensemble classifier as formulated in equations (8) and (13). (b) The dimension of the GO database space in Chou and Cai (2003a
, 2004b
) is 1930, but the dimension of GO database space here is 9918, indicating the need to catch up with the rapid development in GO. Besides, it is through Tables I and III presented here that the relationship between GO and Swiss-Prot is more clearly elucidated than in the previous papers (Chou and Cai, 2003a
, 2004b
).
| Results and discussion |
|---|
|
|
|---|
For the proteins listed in the Online Supplementary Materials A and B, we obtained the following results according to Steps 12 of Methods section: (a) of the 220 Gram-positive proteins in the learning dataset, 211 got hits in the GO_compress database, and hence were defined in the 9918-D GO_compress space (equations (4) and (5)), and the remaining 9 proteins defined in the
-D PseAA space (equation (6)); (b) of the 232 proteins in the testing dataset, 223 got hits and were defined in the 9918-D GO_compress space, and the remaining 9 protein was defined in the
-D PseAA space. Although for the benchmark datasets studied here, the number of proteins that could be meaningfully defined in the 9918-D GO_compress space was overwhelming, this does not mean that there is no need to include the OET-
Pse predictor because, as shown in Table I, currently there still are
5.4% of Gram-positive proteins that have no any corresponding GO number. Therefore, in practical application, cases do exist where the query proteins may not be meaningfully defined in the GO system.
Although such a problem will be eventually solved with the continuous development of the GO database, it would be harmless and makes the predictor more complete to keep the OET-
Pse classifier in the system since the prediction process is logically operated according to the following criterion or hierarchy: if a query protein can be defined in the 9918-D GO_compress space, then the classifier OET-
GO is used to predict its subcellular location; otherwise, the classifier OET-
Pse is used.
The prediction quality was tested by jackknife cross-validation and independent dataset validation. The jackknife test is thought one of the most rigorous and objective methods for cross-validation in statistics (see Chou and Zhang, 1995
for a comprehensive review) and have been increasingly used by investigators (Zhou, 1998
; Feng, 2001
; Zhou and Assa-Munt, 2001
; Feng, 2002
; Luo et al., 2002
; Liu et al., 2005
; Wang et al., 2005
; Guo et al., 2006
; Sun and Huang, 2006
; Wen et al., 2006
; Xiao et al., 2006
; Zhang et al., 2006
) in examining the accuracy of various prediction methods. Therefore, the power of a predictor should be measured by the success rate of jackknife test. The independent dataset test performed here was just for a demonstration of practical application.
The success rates obtained by jackknife and independent dataset tests for each of the 5 Gram-positive protein subcellular localization sites are given in Table IV, from which we can see that, for those subcellular locations with
60 protein samples, the success rates are quite high, and that the overall success rates by the jackknife test and independent dataset test are 82.7% and 84.1%, respectively. Therefore, it is expected that, with more data available to improve the learning dataset, particularly the subsets with < 20 protein samples, the success rates will be further enhanced.
|
| Conclusion |
|---|
|
|
|---|
With the explosion of newly found protein sequences entering into protein databanks, prediction of protein subcellular locations has become increasingly important. In this article, the GO-discrete model was introduced to represent the sample of a protein. On the basis of such a frame of representation, the ensemble classifier Gpos-PLoc was developed for predicting the subcellular location of Gram-positive bacterial proteins. Gpos-PLoc was formed by fusing many basic classifiers with each engineered by the OET-KNN rule. The high success rates indicate that proteins, if represented through the GO-discrete model, can be more distinctly clustered according to their different subcellular locations, and that the ensemble classifier presented here is indeed a powerful operator in distinguishing these clusters. Gpos-PLoc is freely available as a web-server to public.
| Appendix 1: The optimized evidence-theoretic K-nearest neighbors (OET-KNN) classifier |
|---|
|
|
|---|
For reader's convenience, a brief introduction of the OET-KNN classifier is given below. For further explanation, refer to Shen and Chou (2005b)
|
| A1 |
|
| A2 |
i(i = 1, 2, ... , N) take values in F of equation (A1). According to the KNN (K-nearest neighbors) rule (Cover and Hart, 1967
The ET-KNN (evidence theoretic K-nearest neighbors) rule is a pattern classification method based on the DempsterShafer theory of belief functions (Denoeux, 1995
). In the classification process, each neighbor of a pattern to be classified is considered as an item of evidence supporting certain hypotheses concerning the class membership of that pattern. Based on this evidence, basic belief masses are assigned to each subset concerned. Such masses are obtained for each of the K-nearest neighbors of the pattern under consideration and aggregated using the Dempster's rule of combination (Shafer, 1976
). A decision is made by assigning a pattern to the class with the maximum credibility.
Suppose P is a query protein to be classified, and SKP is the set of its K-nearest neighbors in the training dataset
of equation (A2). Thus, for any Pi
SKP, the knowledge that Pi belongs to class
µ
can be considered as a piece of evidence that increases our belief that P also belongs to
µ. According to the basic belief assignment mapping theory (Shafer, 1976
), this item of evidence can be formulated by
|
| A3 |
0 is a fixed parameter,
µ a parameter associated with class
µ and D2 (Pi, P) the square Euclidean distance between P and Pi. In the ET-KNN rule, it was not addressed how to optimally select the parameters. In 1998, an optimization procedure to determine the optimal or near-optimal parameter values was proposed from the data by minimizing an error function (Zouhal and Denoeux, 1998
The belief function of P belonging to class
µ is a combination of its K-nearest neighbors, and can be formulated as
|
| A4 |
is called the orthogonal sum, which is commutative and associative, and hence equation (A4) can be expressed as
|
| A5 |
i=1K represents the orthogonal sum from i = 1 to K. According to Dempster's rule (Shafer, 1976|
| A6 |
,
and
are the symbols in set theory, representing contained in, intersection and the empty set, respectively.
A decision is made by assigning the query protein P to the class with which the belief or credibility function of equation (A6) has the maximum value; i.e. if
|
| A7 |
µ is the class predicted for the query protein P.
| Footnotes |
|---|
Edited by Micheal Deem
| References |
|---|
|
|
|---|
Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., et al. (2004) Nucleic Acids Res. 32:D115D119.
Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. (2000) Nat. Genet. 25:2529.[CrossRef][Web of Science][Medline]
Bairoch A. and Apweiler R. (2000) Nucleic Acids Res. 25:3136.
Cai Y.D., Zhou G.P., Chou K.C. (2003) Biophys. J. 84:32573263.
Cedano J., Aloy P., P'erez-Pons J.A., Querol E. (1997) J. Mol. Biol. 266:594600.[CrossRef][Web of Science][Medline]
Chou J.J. and Zhang C.T. (1993) J. Theor. Biol. 161:251262.[CrossRef][Web of Science][Medline]
Chou K.C. (1995) Proteins: Struct., Funct. Genet. 21:319344.[CrossRef][Web of Science][Medline]
Chou K.C. (2001) Proteins: Struct., Funct. Genet. 43:246255 (Erratum: ibid., 2001, 44, 60).[CrossRef][Web of Science][Medline]
Chou K.C. (2004) Curr. Med. Chem. 11:21052134.[Web of Science][Medline]
Chou K.C. (2005) Bioinformatics 21:1019.
Chou K.C. and Cai Y.D. (2002) J. Biol. Chem. 277:4576545769.
Chou K.C. and Cai Y.D. (2003a) Biochem. Biophys. Res. Commun. 311:743747.[CrossRef][Web of Science][Medline]
Chou K.C. and Cai Y.D. (2003b) J. Cell. Biochem. 90:12501260 (Addendum: ibid., 2004, 91, 1085).[CrossRef][Web of Science][Medline]
Chou K.C. and Cai Y.D. (2004a) Biochem. Biophys. Res. Commun. 321:10071009 (Corrigendum: ibid., 2005, 329, 1362).[CrossRef][Web of Science][Medline]
Chou K.C. and Cai Y.D. (2004b) Biochem. Biophys. Res. Commun. 320:12361239.[CrossRef][Web of Science][Medline]
Chou K.C. and Cai Y.D. (2005) J. Chem. Inform. Model. 45:407413.[CrossRef]
Chou K.C. and Cai Y.D. (2006) Biochem. Biophys. Res. Commun. 339:10151020.[CrossRef][Web of Science][Medline]
Chou K.C. and Elrod D.W. (1999) Protein Eng. 12:107118.
Chou K.C. and Shen H.B. (2006) J. Proteome Res. 5:18881897.[CrossRef][Web of Science][Medline]
Chou K.C. and Zhang C.T. (1994) J. Biol. Chem. 269:2201422020.
Chou K.C. and Zhang C.T. (1995) Crit. Rev. Biochem. Mol. Biol. 30:275349.[Web of Science][Medline]
Cover T.M. and Hart P.E. (1967) IEEE Trans. Inform. Theory IT-13:2127.[CrossRef]
Denoeux T. (1995) IEEE Trans. Syst, Man Cybern. 25:804813.[CrossRef]
Feng Z.P. (2001) Biopolymers 58:491499.[CrossRef][Web of Science][Medline]
Feng Z.P. (2002) In Silico Biol. 2:291303.[Medline]
Gardy J.L., Spencer C., Wang K., Ester M., Tusnady G.E., Simon I., Hua S., deFays K., Lambert C., Nakai K., et al. (2003) Nucleic Acids Res. 31:36133617.
Guo Y.Z., Li M., Lu M., Wen Z., Wang K., Li G., Wu J. (2006) Amino Acids 30:397402.[CrossRef][Web of Science][Medline]
Keller J.M., Gray M.R., Givens J.A. (1985) IEEE Trans. Syst. Man Cybern. 15:580585.
Liu H., Yang J., Ling J.G., Chou K.C. (2005) Biochem. Biophys. Res. Commun. 338:10051011.[CrossRef][Web of Science][Medline]
Lubec G., Afjehi-Sadat L., Yang J.W., John J.P. (2005) Prog. Neurobiol. 77:90127.[CrossRef][Web of Science][Medline]
Luo R.Y., Feng Z.P., Liu J.K. (2002) Eur. J. Biochem. 269:42194225.[Web of Science][Medline]
Mahalanobis P.C. (1936) Proc. Natl. Inst. Sci. India 2:4955.
Mardia K.V., Kent J.T., Bibby J.M. (1979) Multivariate Analysis(Academic Press, London) (Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis (pp. 322381)).
Nakai K. (2000) Adv. Protein Chem. 54:277344.[Web of Science][Medline]
Nakai K. and Horton P. (1999) Trends Biochem. Sci. 24:3436.[CrossRef][Web of Science][Medline]
Nakai K. and Kanehisa M. (1991) Proteins: Struct. Funct. Genet. 11:95110.[CrossRef][Web of Science][Medline]
Nakashima H. and Nishikawa K. (1994) J. Mol. Biol. 238:5461.[CrossRef][Web of Science][Medline]
Pillai K.C.S. (1985) In Kotz S. and Johnson N.L. (Eds.). Encyclopedia of Statistical SciencesJohn Wiley & Sons 5: pp. 176181 (This reference also presents a brief biography of Mahalanobis who was a man of great originality and who made considerable contributions to statistics, New York).
Shafer G. (1976) A Mathematical Theory of Evidence(Princeton University Press, Princeton, NJ).
Shen H.B. and Chou K.C. (2005a) Biochem. Biophys. Res. Commun. 337:752756.[CrossRef][Web of Science][Medline]
Shen H.B. and Chou K.C. (2005b) Biochem. Biophys. Res. Commun. 334:288292.[CrossRef][Web of Science][Medline]
Sun X.D. and Huang R.B. (2006) Amino Acids 30:469475.[CrossRef][Web of Science][Medline]
Wang G.L. and Dunbrack R.L. Jr. (2003) Bioinformatics 19:15891591.
Wang M., Yang J., Xu Z.J., Chou K.C. (2005) J. Theor. Biol. 232:715.[CrossRef][Web of Science][Medline]
Wen Z., Li M., Li Y., Guo Y., Wang K. (2006) Amino Acids doi:10.1007/S00726-006-0341-y.
Xiao X., Shao S., Ding Y., Huang Z., Huang Y., Chou K.C. (2005) Amino Acids 28:5761.[CrossRef][Web of Science][Medline]
Xiao X., Shao S.H., Huang Z.D., Chou K.C. (2006) J. Comput. Chem. 27:478482.[CrossRef][Web of Science][Medline]
Zhang S.W., Pan Q., Zhang H.C., Shao Z.C., Shi J.Y. (2006) Amino Acids 30:461468.[CrossRef][Web of Science][Medline]
Zhou G.P. (1998) J. Protein Chem. 17:729738.[CrossRef][Web of Science][Medline]
Zhou G.P. and Assa-Munt N. (2001) Proteins: Struct. Funct. Genet. 44:5759.[CrossRef][Web of Science][Medline]
Zhou G.P. and Doctor K. (2003) Proteins: Struct. Funct. Genet. 50:4448.[CrossRef][Web of Science][Medline]
Zouhal L.M. and Denoeux T. (1998) IEEE Trans. Syst. Man Cybern. 28:263271.[CrossRef]
Zuber B., Haenni M., Ribeiro T., Minnig K., Lopes F., Moreillon P., Dubochet J. (2006) J. Bacteriol. 188:66526660.
Received October 11, 2006; revised November 20, 2006; accepted November 22, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
