PEDS Advance Access originally published online on October 10, 2006
Protein Engineering Design and Selection 2006 19(11):511-516; doi:10.1093/protein/gzl038
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Classification of G-protein coupled receptors at four levels
Institute of Automation, National University of Defense Technology Changsha 410073, Hunan, People's Republic of China
1To whom correspondence should be addressed. E-mail: qbgao{at}nudt.edu.cn
| Abstract |
|---|
|
|
|---|
G-protein coupled receptors (GPCRs) are transmembrane proteins which via G-proteins initiate some of the important signaling pathways in a cell and are involved in various physiological processes. Thus, computational prediction and classification of GPCRs can supply significant information for the development of novel drugs in pharmaceutical industry. In this paper, a nearest neighbor method has been introduced to discriminate GPCRs from non-GPCRs and subsequently classify GPCRs at four levels on the basis of amino acid composition and dipeptide composition of proteins. Its performance is evaluated on a non-redundant dataset consisted of 1406 GPCRs for six families and 1406 globular proteins using the jackknife test. The present method based on amino acid composition achieved an overall accuracy of 96.4% and Matthew's correlation coefficient (MCC) of 0.930 for correctly picking out the GPCRs from globular proteins. The overall accuracy and MCC were further enhanced to 99.8% and 0.996 by dipeptide composition-based method. On the other hand, the present method has successfully classified 1406 GPCRs into six families with an overall accuracy of 89.6 and 98.8% using amino acid composition and dipeptide composition, respectively. For the subfamily prediction of 1181 GPCRs of rhodopsin-like family, the present method achieved an overall accuracy of 76.7 and 94.5% based on the amino acid composition and dipeptide composition, respectively. Finally, GPCRs belonging to the amine subfamily and olfactory subfamily of rhodopsin-like family were further analyzed at the type level. The overall accuracy of dipeptide composition-based method for the classification of amine type and olfactory type of GPCRs reached 94.5 and 86.9%, respectively, while the overall accuracy of amino acid composition-based method was very low for both subfamilies. In comparison with existing methods in the literature, the present method also displayed great competitiveness. These results demonstrate the effectiveness of our method on identifying and classifying GPCRs correctly. GPCRsIdentifier, a corresponding stand-alone executable program for GPCR identification and classification was also developed, which can be acquired freely on request from the authors for academic purposes.
| Introduction |
|---|
|
|
|---|
G-protein coupled receptors (GPCRs) play an extremely important role in transducing extracellular signals across the cell membrane via guanine-binding proteins (G-proteins) with high specificity and sensitivity, a good example is the coupling interaction mechanism between thromboxane A2 receptor and alpha-13 subunit of guanine nucleotide-binding protein presented detailedly in literature (Chou, 2005a
Many methods have been proposed for the prediction of GPCRs in the past few years. One commonly used method is sequence similarity searching in protein database by sequence alignment tools based on pairwise similarity, such as BLAST and FASTA (Altschul et al., 1997
; Pearson, 2000
). Several pattern databases have been constructed by some investigators (Lapinsh et al., 2002
; Sadowski and Parish, 2003
). However, these methods are not always successful when the query protein sequences have no significant sequence similarity to the database sequences. On the other hand, in the case of GPCRs, the functionsimilarity relationship is still unclear (Yabuki et al., 2005
). Therefore, some statistical and machine learning methods have been proposed, including the statistical analysis method (Chou and Elrod, 2002
), covariant discriminant algorithm (Elrod and Chou, 2002
; Chou, 2005b
), support vector machines (SVMs) (Karchin et al., 2002
; Bhasin and Raghava, 2004
, 2005
; Guo et al., 2005
), hidden Markov models (HMMs) (Qian et al., 2003
; Papasaikas et al., 2004
) and bagging classification tree (Huang et al., 2004
). There are other methods, such as binary topology pattern (Inoue et al., 2004
). Though most of the methods have achieved a high overall accuracy at superfamily, family, subfamily or type level, none of them consider the prediction of GPCRs at four levels. In general, a four-level prediction of GPCRs is suitable for understanding their biological function in a cell comprehensively. And besides, many novel receptors have entered into the GPCRDB. They are expected to improve the recognition performance of GPCRs significantly. These motivate us to develop a novel method to deal with such situation.
In this paper, we introduced a nearest neighbor method for the identification and classification of GPCRs at four levels using a stepwise procedure. First, the method was used to identify GPCRs from non-GPCRs. In our work the non-GPCRs corresponded to the globular proteins. Then, the GPCRs were classified into six major families. Next, those GPCRs of rhodopsin-like family were further divided at subfamily level. Finally, GPCRs falling into the amine subfamily and olfactory subfamily of rhodopsin-like family were analyzed at type level. In this study, amino acid composition and dipeptide composition of proteins were used to represent protein sequences. The dipeptide composition has been used to predict the contents of protein secondary structures (Chou, 1999
; Liu and Chou, 1999
). The prediction performance was evaluated on a non-redundant dataset consisted of 1406 GPCRs and 1406 globular proteins using the jackknife test. A high overall accuracy has been achieved in each step using dipeptide-based method. Moreover, comparisons with existing methods in the literature show that the present method achieves great competitive performance.
| Materials and methods |
|---|
|
|
|---|
Dataset
To exclude the homology bias, the proposed method was developed on a new non-redundant dataset. The original dataset built in our work comprises 5305 GPCRs from six families and 2466 globular proteins. All the GPCRs were extracted from the GPCRDB information system March 2005 release 9.0 (Horn et al., 2003
). The globular proteins were extracted from the PDB90D_1.37 database of SCOP (Murzin et al., 1995
; Berman et al., 2000
), which have been used by some previous methods (Karchin et al., 2002
; Bhasin and Raghava, 2004
, 2005
). All putative, orphan receptors and fragments of receptors were excluded from the dataset. Furthermore, to reduce the bias, a redundancy reduction procedure was performed on the original dataset. Sequences with a high degree of similarity to other sequences in the dataset were removed by the program CD-HIT (Li et al., 2001
, 2002
). This program clusters protein sequence database at high sequence identity threshold and can remove the high sequence redundancy efficiently. We grouped all protein sequences by CD-HIT with the cluster identity threshold of 0.7 to ensure that no sequence had >70% sequence similarity to any sequences in the dataset. After such a screening procedure, the resulting dataset contains 1406 GPCRS and 2046 globular proteins. We only selected 1406 globular proteins for subsequent discrimination analysis, so that the number of positive and negative prototypes in the final dataset was equal.
Nearest neighbor algorithm
Nearest neighbor algorithm is a simple yet effective method for performing general, non-parametric classification (Cover and Hart, 1967
). It has been used to predict protein secondary structure (Yi and Lander, 1993
), protein ß-turn (Kim, 2004
), enzyme family class (Chou and Cai, 2004a
,b
), protein subcellular localization (Chou and Cai, 2003
; Cai and Chou, 2004
; Gao et al., 2005
), membrane type (Chou and Cai, 2005
) and proteinprotein interaction (Chou and Cai, 2006
). In view of its low probability of error, we apply nearest neighbor algorithm to the task of identification and classification of GPCRs.
The basic idea of nearest neighbor algorithm is described briefly as follows (Duda et al., 2000
). Let Dn = {x1,...,xn} denote a set of n labeled prototypes and let x'
Dn be the prototype nearest to a test point x. Then the nearest neighbor rule for classifying x is to assign it to the label associated with x'. The nearest neighbor rule is a suboptimal procedure and leads to an error rate greater than the minimum possible, the Bayes rate. However, with an unlimited number of prototypes the error rate is never worse than twice the Bayes rate. To accomplish this task, we define a similarity measurement based on Euclidean distance by
![]() | (1) |
![]() | (2) |
Protein representation
Amino acid composition: Amino acid composition represents the occurrence frequency of all natural amino acids in a protein sequence and corresponds to a 20D (dimensional) feature vector. The occurrence frequency of amino acid i is calculated using the following equation:
![]() | (3) |
Dipeptide composition: Dipeptide composition represents the occurrence frequency of all consecutive amino acid pairs in a protein sequence and corresponds to a 400D feature vector. It can encapsulate information about composition of amino acids as well as their local order, and thus a better way to characterize protein features in contrast with amino acid composition. The occurrence frequency of amino acid pair i is given using the following equation:
![]() | (4) |
Performance measurement
To measure the performance of our method, the jackknife test is performed on the dataset. In the jackknife test, each protein sequence in the dataset is singled out in turn as a test sample and the remaining protein sequences are used as a training dataset to predict the label of the test sample. Thus, this process is repeated N times for a dataset of N proteins. Compared with other cross-validation methods, such as the sub-sampling and independent dataset test, the jackknife test is considered to be the most effective way and somewhat more rigorous and reliable (Mardia et al., 1979
). A comprehensive analysis is provided in literature (Chou and Zhang, 1955). The jackknife test has been used to measure the performance of various predictors (Zhou, 1998
; Yuan, 1999
; Cai, 2001
; Feng, 2001
; Hua and Sun, 2001
; Zhou and Assa-Munt, 2001
; Zhou and Doctor, 2003
; Xiao et al., 2005
, 2006
; Guo et al., 2006
; Sun and Huang, 2006
; Zhou and Cai, 2006
). Moreover, five measures, sensitivity, specificity, accuracy, overall accuracy and Matthew's correlation coefficient (MCC) (Matthews, 1975
), are used to access the performance of the present method. The sensitivity and specificity are only used to estimate the performance of the method in discriminating GPCRs from globular proteins, together with the overall accuracy and MCC. As for the classification of GPCR at family level, subfamily level and type level, the accuracy, overall accuracy and MCC are used to access the performance of the present method. They are defined by Hua and Sun (2001)
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
| Results and discussion |
|---|
|
|
|---|
Our work tries to predict GPCRs at superfamily level, family level, subfamily level and type level, respectively, according to a four-step strategy. The scheme used for encoding proteins is based on the amino acid composition and dipeptide composition. In this way, each protein sequence is represented by a fixed-length feature vector. The prediction results obtained from each step were outlined as follows.
Identification of GPCRs from globular proteins
The performance of our method in identifying 1046 GPCRs from 1046 globular proteins is shown in Table I. All the results were obtained using the jackknife test. From Table I we knew that the MCC and overall accuracy of amino acid composition-based method reached 0.930 and 96.4%, respectively. The MCC and overall accuracy of dipeptide composition-based method reached 0.996 and 99.8%, respectively. This indicates that both the amino acid composition and dipeptide composition encapsulate sufficient information to distinguish GPCRs from globular proteins with very high accuracy.
|
Classification of GPCRs at family level
The 1406 GPCRs belonging to six families were used to access the performance of the present method for classification of GPCRs at family level. The number of proteins in each family was shown in Table II. The performance was tested by the jackknife test. From Table II we knew that the overall accuracy of amino acid composition-based method was 89%, while the overall accuracy of dipeptide composition-based method reached 98.8%. This illustrates that both amino acid composition and dipeptide composition can be used to classify GPCRs at family level with high accuracy.
|
Classification of GPCRs at subfamily level
The rhodopsin-like family of GPCRs is the largest class from GPCRDB information system. There were 1181 proteins left after the redundancy reduction procedure. We used the present method to classify these proteins into 15 subfamilies. Table III shows the number of proteins in each subfamily and the prediction performance obtained using the jackknife test. The overall accuracy based on amino acid composition was 76.7%, which indicates that amino acid composition alone is not good enough to characterize the GPCRs of rhodopsin-like family. However, a high overall accuracy of 94.5% was achieved by the dipeptide composition-based method. Furthermore, lower performance was observed for those small size groups. This suggests that a better performance would be reached when a complete and larger dataset is established.
|
Classification of GPCRs at type level
For the classification of GPCRs at type level, we only considered proteins coming from the amine subfamily and olfactory family of rhodopsin-like family. The other subfamilies with fewer prototypes in their subclasses will be investigated in the future when more proteins are available to build a larger dataset. We derived 165 proteins for amine subfamily and 551 proteins for olfactory subfamily. They were used to measure the performance of the present method for classification of GPCRs at type level. The number of proteins in their subclass and the prediction performance are shown in Tables IV and V, respectively. As the prediction performance of amino acid composition-based method was very poor, we only provided the results achieved by dipeptide composition-based method. From Tables IV and V we knew that the overall accuracy for the classification of the 165 GPCRs of amine subfamily reached 94.5%, while the overall accuracy for the classification of the 551 GPCRs of olfactory subfamily was 86.9%. It is clear that high accuracy has been reached by the present method for classification of GPCRs at type level.
|
|
The MCC reflects both the sensitivity and specificity of the prediction method. From Tables IV we can see that most values of MCC are near to 1; this shows that the proposed method has high sensitivity and specificity for GPCRs classification. All the results mentioned above indicate that the performance of dipeptide composition-based method is much superior to the performance of amino acid composition-based method. This illustrates the fact that dipeptide composition is better than amino acid composition in capturing the information about protein features. Dipeptide composition contains information about the composition of amino acids as well as their local order. Hence, we may conclude that the sequence local order effect of GPCRs has a significant contribution to their biological function in a cell.
Comparison with other methods
In order to verify the performance of our method, we made comparisons with other methods. The dataset constructed by Karchin et al. (2002)
has been tested with the SVMs method for GPCRs recognition (Karchin et al., 2002
; Bhasin and Raghava, 2004
). This dataset contains 778 GPCRs for five families of GPCR, where 692 are from rhodopsin-like family, 56 from secretin-like family, 16 from metabotropic glutamate family, 11 from fungal pheromone family and 3 from cAMP-like family. We compared our method with the SVMs method proposed by Bhansin and Raghava (2004)
, for they have reported a better performance in comparison with that of Karchin et al. (2002)
. The comparison in classifying GPCRs at family level is shown in Table VI. From Table VI we can see that the SVMs method achieved an overall accuracy of 97.3%, while the present method achieved an overall accuracy of 99.2%. This indicates that the performance of our method is better than that of the SVM-based method.
|
On the other hand, the dataset from the study of Elrod and Chou (2002)
|
Software availability
We have developed a PC/Windows program, GPCRs Identifier, based on the present method for identification and classification of GPCRs at four levels. The program is freely available on request from the authors for scientific purposes, together with the accompanying manual and sample datasets.
| Conclusion |
|---|
|
|
|---|
In this paper, we have introduced a nearest neighbor method for identifying and classifying GPCRs with a four-step strategy. Amino acid composition and dipeptide composition are used for the representation of protein sequences, respectively. This method is evaluated on a new dataset consisted of 1406 GPCRs and 1406 globular proteins, and high prediction accuracy has been achieved in a jackknife test. Compared with the existing methods, our method provides a superior performance. For the convenience of use in practical application we have developed a stand-alone executable program GPCRsIdentifier based on the proposed method. This program is freely available on request from the authors. The most attractive advantages of this novel method are its simplicity and efficiency. On the other hand, researchers are usually perplexed with some machine learning methods such as SVMs due to the problems of kernel selection and parameter determination. Unfortunately, there is no mature theory at present to tackle these problems. However, no parameters need to be determined for the nearest neighbor method. Therefore, the present method would be a useful tool for the identification and classification of GPCRs.
| Acknowledgements |
|---|
|
|
|---|
This work was partly supported by the National Natural Science Foundation of China (No. 60471003). The authors are very grateful to the anonymous referees who gave us helpful comments which improved the quality of this paper.
| References |
|---|
|
|
|---|
Altschul S.F., Adden T.L., Schaffer A.A., Zhang Z., Miller W., Lipman D.J. (1997) Nucleic Acids Res. 25:33893402.
Baldwin J.M. (1994) Curr. Opin. Cell Biol. 6:180190.[CrossRef][ISI][Medline]
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyyalov I.N., Bourne P.E. (2000) Nucleic Acids Res. 28:235242.
Bhasin M. and Raghava G.P.S. (2004) Nucleic Acids Res. 32:W383W389.
Bhasin M. and Raghava G.P.S. (2005) Nucleic Acids Res. 33:W143W147.
Cai Y.D. (2001) Proteins 43:336338.[CrossRef][ISI][Medline]
Cai Y.D. and Chou K.C. (2004) Bioinformatics 20:11511156.
Chou K.C. (1999) J. Protein Chem. 18:473480.[CrossRef][ISI][Medline]
Chou K.C. (2005a) J. Proteome Res. 4:16811686.[CrossRef][ISI][Medline]
Chou K.C. (2005b) J. Proteome Res. 4:14131418.[CrossRef][ISI][Medline]
Chou K.C. and Cai Y.D. (2003) Biochem. Biophys. Res. Comm. 311:743747.[CrossRef][ISI][Medline]
Chou K.C. and Cai Y.D. (2004a) Biochem. Biophys. Res. Comm. 325:506509.[CrossRef][ISI][Medline]
Chou K.C. and Cai Y.D. (2004b) Protein Sci. 13:28572863.
Chou K.C. and Cai Y.D. (2005) Biochem. Biophys. Res. Comm. 327:845847.[CrossRef][ISI][Medline]
Chou K.C. and Cai Y.D. (2006) J. Proteome Res. 5:316322.[CrossRef][ISI][Medline]
Chou K.C. and Elrod D.W. (2002) J. Proteome Res. 1:429433.[CrossRef][ISI][Medline]
Chou K.C. and Zhang C.T. (1995) Crit. Rev. Biochem. Mol. Biol. 30:275349.[ISI][Medline]
Cover T.M. and Hart P.E. (1967) IEEE Trans. Inform. Theory IT-13:2127.[CrossRef]
Duda R.O., Hart P.E., Stork D.G. (2000) Pattern Classification 2nd edn (Wiley, New York).
Elrod D.W. and Chou K.C. (2002) Protein Eng. 15:713715.
Feng Z.P. (2001) Biopolymers 58:491499.[CrossRef][ISI][Medline]
Gao Q.B., Wang Z.Z., Yan C., Du Y.H. (2005) FEBS Lett. 579:34443448.[CrossRef][ISI][Medline]
Guo Y.Z., Li M.L., Wang K.L., Wen Z.N., Lu M.C., Liu L.X., Jiang L. (2005) Acta Biochim. Biophys. Sin. 37:759766.
Guo Y.Z., Li M., Lu M., Wen Z., Wang K., Li G., Wu J. (2006) Amino Acids 30:397402.[CrossRef][ISI][Medline]
Horn F., Bettler E., Oliveira L., Campagne F., Cohen F.E., Vriend G. (2003) Nucleic Acids Res. 31:294297.
Hua S. and Sun Z. (2001) Bioinformatics 17:721728.
Huang Y., Cai J., Ji L., Li Y. (2004) Comput. Biol. Chem. 28:275280.[CrossRef][ISI][Medline]
Inoue Y., Ikeda M., Shimizu T. (2004) Comput. Biol. Chem. 28:3949.[CrossRef][ISI][Medline]
Karchin R., Karplus K., Haussler D. (2002) Bioinformatics 12:147159.
Kim S. (2004) Bioinformatics 20:4044.
Lapinsh M., Gutcaits A., Prusis P., Post C., Lundstedt T., Wikberg J.E.S. (2002) Protein Sci. 11:795805.
Li W., Jaroszewski L., Godzik A. (2001) Bioinformatics 17:282283.
Li W., Jaroszewski L., Godzik A. (2002) Bioinformatics 18:7782.
Liu W. and Chou K.C. (1999) Protein Eng. 12:10411050.
Mardia K.V., Kent J.T., Bibby J.M. (1979) Multivariate Analysis(Academic Press, London).
Matthews B.W. (1975) Biochim. Biophys. Acta 405:442451.[Medline]
Murzin A.G., Brenner S.E., Hubbard T., Chothia C. (1995) J. Mol. Biol. 247:536540.[CrossRef][ISI][Medline]
Papasaikas P.K., Bagos P.G., Litou Z.I., Promponas V.J., Hamodrakas S.J. (2004) Nucleic Acids Res. 32:W380W382.
Pearson W.R. (2000) Methods Mol. Biol. 132:185219.[Medline]
Qian B., Soyer O.S., Neubig R.R., Goldstein R.A. (2003) FEBS Lett. 554:9599.[CrossRef][ISI][Medline]
Sadowski M.I. and Parish J.H. (2003) Bioinformatics 19:727734.
Spiegel A.M., Shenker A., Weinstein L.S. (1992) Endocr. Rev. 13:536565.[CrossRef][ISI][Medline]
Strader C.D., Fong T.M., Tota M.R., Underwood D. (1994) Annu. Rev. Biochem. 63:101132.[CrossRef][ISI][Medline]
Sun X.D. and Huang R.B. (2006) Amino Acids 30:469475.[CrossRef][ISI][Medline]
Teller D.C., Okada T., Behnke C.A., Palczewski K., Stenkamp R.E. (2001) Biochemistry 40:77617772.[CrossRef][Medline]
Vaidehi N., Floriano W.B., Trabanino R., Hall S.E., Freddolino P., Choi E.J., Zamanakos G., Goddard W.A. III. (2002) Proc. Natl Acad. Sci. USA 99:1262212627.
Xiao X., Shao S., Ding Y., Huang Z., Huang Y., Chou K.C. (2005) Amino Acids 28:5761.[CrossRef][ISI][Medline]
Xiao X., Shao S., Ding Y., Huang Z., Huang Y., Chou K.C. (2006) Amino Acids 30:4954.[CrossRef][ISI][Medline]
Yabuki Y., Muramatsu T., Hirokawa T., Mukai H., Suwa M. (2005) Nucleic Acids Res. 33:W148W151.
Yi T.M. and Lander E.S. (1993) J. Mol. Biol. 232:11171129.[CrossRef][ISI][Medline]
Yuan Z. (1999) FEBS Lett. 451:2326.[CrossRef][ISI][Medline]
Zhang S.W., Pan Q., Zhang H.C., Shao Z.C., Shi J.Y. (2006) Amino Acids 30:461468.[CrossRef][ISI][Medline]
Zhou G.P. (1998) J. Protein Chem. 17:729738.[CrossRef][ISI][Medline]
Zhou G.P. and Assa-Munt N. (2001) Proteins 44:5759.[CrossRef][ISI][Medline]
Zhou G.P. and Cai Y.D. (2006) Proteins Struct. Funct. Bioinf. 63:681684.[CrossRef]
Zhou G.P. and Doctor K. (2003) Proteins 50:4448.[CrossRef][ISI][Medline]
Received June 17, 2006; revised July 12, 2006; accepted August 19, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H.-B. Shen and K.-C. Chou Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM Protein Eng. Des. Sel., November 10, 2007; (2007) gzm057v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









