PEDS Advance Access originally published online on August 16, 2004
Protein Engineering Design and Selection 2004 17(6):509-516; doi:10.1093/protein/gzh061
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition
1Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, 3Bioinformatics Research Centre, Donghau University, Shanghai 200050, 4Department of Biomedical Engineering, Shanghai Jiaotong University, Shanghai 200030, 5Tianjin Institute of Bioinformatics and Drug Discovery, Tianjin, China and 6Gordon Life Science Institute, San Diego, CA 92130, USA
2 To whom correspondence should be addressed. E-mail: mengking{at}sjtu.edu.cn; kchou{at}san.rr.com
| Abstract |
|---|
|
|
|---|
Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins. Prediction of membrane protein types has become one of the growing hot topics in bioinformatics. Currently, we are facing two critical challenges in this area: first, how to take into account the extremely complicated sequence-order effects, and second, how to deal with the highly uneven sizes of the subsets in a training dataset. In this paper, stimulated by the concept of using the pseudo-amino acid composition to incorporate the sequence-order effects, the spectral analysis technique is introduced to represent the statistical sample of a protein. Based on such a framework, the weighted support vector machine (SVM) algorithm is applied. The new approach has remarkable power in dealing with the bias caused by the situation when one subset in the training dataset contains many more samples than the other. The new method is particularly useful when our focus is aimed at proteins belonging to small subsets. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, indicating that the current approach may serve as a powerful complementary tool to other existing methods for predicting the types of membrane proteins.
Keywords: Chou's invariance theorem/covariant discriminant algorithm/pseudo-amino acid composition/spectral analysis/weighted
-SVM
| Introduction |
|---|
|
|
|---|
Owing to the development of high-throughput sequencing technology, the data in various biology databases have been increasing at an unprecedented rate, which challenges the speed and ability of biologists and computational scientists to analyze these data. Because most of the specific functions of a cell are carried out by the membrane proteins (see e.g. Alberts et al., 1994; Lodish et al., 1995), prediction of membrane protein types has become a vitally important subject in molecular and cellular biology. Although the type of a membrane protein can be determined by various biochemical experiments, it is both time consuming and costly if the determination is based on an experimental approach alone. In view of this, it is highly desirable to develop an automated method to expedite the speed of determination.
Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins (Figure 1). The function of a membrane protein is closely related to the type to which it belongs. Therefore, a fast and efficient method for predicting the type of membrane protein will significantly speed up the process of function determination for newly found membrane proteins. In a pioneering study, based on the amino acid composition, the covariant discriminant algorithm was introduced by Chou and Elrod (1999)
to predict the types of membrane proteins. By definition, the conventional amino acid composition is a vector of 20 components, each representing the frequency of occurrence of one of the 20 native amino acids (Nakashima et al., 1986
; Chou, 1995
; Zhou, 1998
). Accordingly, using the amino acid composition to represent a sample of protein will miss all the sequence-order and sequence-length effects. In order to cope with this problem, a new concept, the so-called pseudo-amino acid composition, was proposed by Chou (2001)
. The pseudo-amino acid composition can bear the main features of amino acid composition, but meanwhile it can also incorporate some sequence order effects. Stimulated by its success in improving prediction quality, here we introduce a different approach to formulate the pseudo-amino acid composition.
|
Meanwhile, SVM (support vector machine) has recently been widely used in bioinformatics. However, its performance is greatly limited by the uneven sizes of the subsets in the training dataset. The classification results based on SVM are undesirably biased toward the class with more samples in the corresponding subset. In other words, the larger the size of a subset, the smaller is the classification error; whereas the smaller the size of a subset, the larger is the classification error. In the dataset constructed by Chou and Elrod (1999)
-SVM) to cope with this problem.
This paper is devoted to combining the concept of pseudo-amino acid composition and
-SVM to develop a new predictor for predicting the membrane protein types.
| Pseudo amino acid composition and discrete Fourier transform |
|---|
|
|
|---|
A protein sequence can be represented as a series of amino acids by their single-character codes A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y, formulated as
![]() | (1) |
The conventional amino acid composition is defined as 20 discrete numbers each representing the frequency of occurrence of one of the 20 native amino acids (Nakashima et al., 1986
; Chou and Zhang, 1994
; Chou, 1995
; Zhou, 1998
). Compared with the conventional amino acid composition, the pseudo-amino acid composition is a vector with 20+
discrete components (Chou, 2001
) and hence may be viewed as a point in a (20+
)-D space, as given by
![]() | (2) |
are related to
different ranks (Figure 2) of sequence-order correlation factors as formulated by the following equation (Chou, 2001
![]() | (3) |
|
In the above equation, L denotes the length of the protein and
i is called the ith rank of coupling factor that harbours the ith sequence-order correlation factor. An illustration to show how these factors are associated with the sequence order effect is given in Figure 2. The coupling factor Ji,j in Equation 3 is defined as a function of the amino acids Ri and Rj, such as the physicochemical distance (Schneider and Wrede, 1994
i can be rewritten as:
![]() | (4) |
As can be seen from Figure 2, the sequence-order effect of a protein can be, to some extent, reflected through a set of discrete numbers
1,
2,
3, ...,
m, as defined by Equation 4. Such information is very useful in the analysis of proteins with a set of discrete numbers. Accordingly, the first 20 components of Equation 1 reflect the effect of the amino acid composition, whereas the components from 20 + 1 to 20 +
reflect some sequence-order effects. A set of 20 +
components as formulated by Equations (1) and (2) is called the pseudo-amino acid composition for protein P. Such a name is used because it still has the main features of amino acid composition, but on the other hand it contains the information beyond the conventional amino acid composition. The pseudo-amino acid composition thus defined has the following advantages: compared with the 210-D pair-coupled amino acid composition (Chou, 1999
) and the 400-D first-order coupled amino acid composition (Liu and Chou, 1999
) that contain the sequence order effect only for a very short range (i.e. within two adjacent amino acid residues along a chain), the pseudo-amino acid composition incorporates much more sequence effects, i.e. those not only for the short range but also for the medium and long range, as indicated by a series of sequence-coupling factors with different tiers of correlation (see Figure 2 and Equations 2
4). Therefore, the prediction quality can be significantly improved by using the pseudo-amino acid composition to represent the sample of a protein. For detailed formulation and application of the pseudo-amino acid composition, readers are referred to two recent papers (Chou, 2001
; Chou and Cai, 2003
).
Below we shall use the technique of spectral analysis to formulate the pseudo-amino acid composition. As shown in Equation 1, the sequence of a protein is composed of a series of characters, which is hard for a computer to process because each element in the sequence is a linguistic symbol rather than a numerical value. To cope with this situation, each individual amino acid in the protein sequence has to be coded in a numerical way, i.e. expressed in terms of f(Ri) of Equation 4. As is well known, the hydrophilic value of an amino acid is a very important physicochemical property that has crucial effects on the folding of a protein as well as its function, particularly for membrane proteins. In view of this, we choose the hydrophilic value of Ri for f(Ri). Since a coded protein sequence can be treated as a stationary random process, many technologies in statistical signal processing can be used to characterize the sequence-order effects of a protein sequence.
In statistical signal processing, the correlation, covariance sequence and spectral density function are the three basic statistical quantities of discrete random signals. The true cross-correlation sequence is a statistical quantity defined as
![]() | (5) |
![]() | (6) |
The autocorrelation and autocovariance are their special cases as defined as follows
![]() | (7) |
![]() | (8) |
![]() | (9) |
Comparing Equation 9 and Equations 3 and 4, we can see that the sequence-order correlation factors
m defined by Chou (2001)
are virtually an autocorrelation sequence of the coded protein sequence. Hence all the powerful tools in statistical signal analysis can be used to incorporate sequence-order effects. The goal of spectral analysis is to describe the distribution (over frequency) of the power contained in a signal, based on a finite set of data. The power spectrum of a stationary random process xn is mathematically related to the correlation sequence by the discrete-time Fourier transform. In terms of physical frequency f (e.g. in hertz) is given by
![]() | (10) |
(m) (m = 1,...,L) is the autocorrelation sequence of xn as defined in Equation 9, and j is the sign indicating the imaginary part. Hence the power spectral density (PSD) of the stationary signal xn is defined as
![]() | (11) |
For real signals, the average power of a signal over a particular frequency and
can be found by integrating the PSD over that band:
![]() | (12) |
![]() | (13) |
Since the low-frequency components of a given PSD better reflect the global information (Chou, 1988
, 1989
) of a given signal than the high-frequency components, we took the first 20 components of the energy spectrum to incorporate most of the sequence order information. Thus the pseudo-amino acid composition of a protein can be defined in a 40-D space as given by
![]() | (14) |
![]() | (15) |
In the above equation, fi (i = 1, 2,...,20) denote the frequencies of occurrence of the 20 native amino acids in a protein that actually reflect the amino acid composition (Nakashima et al., 1986
; Chou and Zhang, 1993
; Chou, 1995
) and
j (j = 1, 2,...,20) the first 20 low-frequency coefficients of the energy spectrum of a given protein sequence that reflect some sort of sequence-order effect. The parameter w is used to control how much sequence order information should be considered. In this paper, w was set as 0.15 (Pan et al., 2003
).
Compared with the original pseudo-amino acid composition introduced by Chou (2001)
, the current representation has the following features. (1) It can be seen from Equation 10 that the PSD is mathematically related to the correlation sequence by the discrete-time Fourier transform. Therefore, the sequence order information harboured by autocorrelation sequence is also preserved by the PSD since the transform is linear and will not lose any information. (2) The PSD is a more compact representation. For the coded protein sequences, most of the signal's power concentrates on the low-frequency components of the PSD. Hence the 20 components of the PSD are sufficient to represent the protein sequence as the high-frequency parts contain little information. (3) The low-frequency components represent the global information (Chou, 1988
, 1989
) of the coded sequence. The type of protein can be reflected by the curve of the hydrophobic values of the residues. The curve's global shape, represented by the low-frequency components of the PSD, is more important in determining the type of membrane protein. For example, the appearance of several peaks and valleys in the curve may indicate that the corresponding protein is likely to be a multi-pass trans-membrane protein (Figure 1).
| Results and discussion |
|---|
|
|
|---|
Using the dataset constructed by Chou and Elrod (1999)
|
Instead of the covariant discriminant algorithm, which is a combination of the Mahalanobis distance (Mahalanobis, 1936
-SVM is proposed to improve the prediction accuracy of small subsets. For a detailed description of this algorithm, see Appendix A, where a full introduction to
-SVM is given, followed by analysis of the reasons that lead to the undesirable bias toward the subset with more samples in the training set. Finally, by assigning the samples in different subsets with different weights, the unfavorable impact caused by the uneven class size is compensated.
During the operation, the width of the Gaussian RBFs was selected as 1 to minimize the estimation of the VC dimension. The parameter
is assigned as 0.06. The weight si (i = 1, 2, 3, 4, 5) for each class is determined by the following procedure:
- Set the largest set's smax as 1.
- Set other si (i = 1, 2, 3, 4, 5) by
, where
max denotes the number of training samples of the largestdata set and
i is the number of training samples of otherdata sets.
- Train the weighted
-SVM with the given sample's weight si (i = 1, 2, 3, 4, 5) for each class and classify the new entered data.
After being trained, the hyper-plane was built in the feature space and thus the output could be obtained. The prediction quality was examined by three methods (Chou and Zhang, 1995
), the re-substitution test, the jackknife test and independent dataset test, as explained below.
Re-substitution test
The so-called re-substitution test is designed to examine the self-consistency of an identification method (Zhou, 1998
; Cai, 2001
; Zhou and Assa-Munt, 2001
; Zhou and Doctor, 2003
). When the re-substitution test is performed for the current classifier, the type of each membrane protein in a data set is in turn predicted using the rule parameters derived from the same training data set. In Table I, the success rate for the 2059 membrane proteins is listed; the overall success rate is 99.85%, which shows that after being trained, the weighted
-SVM has captured the complicated relationship between the pseudo-amino acid composition and the types of membrane proteins. Because the rule parameters derived from the training data set harbor the information of the query protein later plugged back into the test, the re-substitution test tends to underestimate the error and enhance the success rate. Therefore, the success rate thus obtained may give some sort of optimistic estimation (Chou and Zhang, 1994
; Zhou, 1998
; Cai, 2001
; Zhou and Assa-Munt, 2001
; Zhou and Doctor, 2003
). However, the re-substitution test is definitely necessary because any algorithm whose self-consistency performance is poor cannot be deemed a good one. Namely, the re-substitution test is necessary but not sufficient for evaluating a classifier. Hence a cross-validation test for an independent testing data set is recommended because it can reflect the generalization of a classifier in practical applications. This is very useful when checking the validity of a training database for whether it contains sufficient information to reflect all the important features concerned so as to yield a high success rate in application. The results of the re-substitution test obtained for the 2059 membrane proteins are given in Tables I and II.
|
Jackknife test
The independent dataset test, sub-sampling test and jackknife test are the three most common methods for cross-validation in statistical prediction. Among these three, the jackknife test is regarded as the most objective and effective one; see, e.g., Chou and Zhang (1995)
for a comprehensive discussion of this and Mardia et al. (1979)
for the mathematical principles. During jackknifing, each membrane protein in the dataset is in turn taken out and all the rule parameters are calculated based on the remaining proteins. In other words, the type of each membrane is predicted using the rule parameter derived from all the other membrane proteins except that which is being identified. During the process of jackknifing, both the training data set and testing data set are actually open and a protein will move from one to the other in turn. The results of the jackknife test thus obtained for the 2059 membrane proteins are also given in Tables I and II.
Independent dataset test
Furthermore, predictions were also conducted for the 2625 independent membrane proteins based on the rule parameter derived from the 2059 proteins in the training dataset. The 2625 independent proteins were also taken from Chou and Elrod (1999)
. Of the 2625 proteins, 478 are type I transmembrane proteins, 180 type II transmembrane proteins, 1867 multi-pass transmembrane proteins, 14 lipid-chain anchored membrane proteins and 86 GPI anchored membrane proteins. The predicted results are also listed in Tables I and II.
From Table I, we may draw the following conclusions. (1) The success predictions obtained by the pseudo-amino acid composition approach are significantly higher than those obtained by the other approaches. (2) A comparison between the current approach and all the other approaches indicates that the success rates by the former are about 6% higher than those by the latter in the self-consistency test and 3% higher in independent dataset test. The only setback is that the jackknife result is about 3.7% lower than that of the functional-domain approach (Cai et al., 2003b
), but higher than all other methods.
It should be pointed out that it is not sufficient only to compare the overall success rate. To make an in-depth comparison, one should look into the success rate for each type. In order to illustrate the weighted
-SVM's ability to compensate for the bias caused by the imbalance of the dataset, a comparison with its original form was done as listed in Table 2, from which the following facts can be deduced. (1) For class 4, which is with the smallest size, the success rate by the weighted
-SVM is about 31% higher than that by the
-SVM by the self-consistency test and about 51% higher by the independent dataset test. For the jackknife test, the weighted
-SVM also outperformed the
-SVM by about 5%. This is fully consistent with what is expected because the effects caused by the dataset size have been taken into consideration during the process of algorithm formulation. (2) The success rates by the weighted
-SVM are higher than or equal to those obtained by the original
-SVM in the self-consistency test, jackknife test and independent dataset test, for classes 1, 2, 4 and 5. (3) For class 3, which has the largest size, the success rates by the weighted
-SVM are lower than those by the
-SVM in the jackknife test and independent dataset test. However, the overall success rates by the weighted
-SVM are higher than those by
-SVM. (4) The results are fully consistent with what we expected: the success rates for the classes with small size are improved at the cost of slightly reducing the success rates for large size classes. This is rational because the samples in small classes are treated as more important, i.e. assigning a larger coefficient for the data in the small class to improve its prediction accuracy. Meanwhile, the importance is that the overall success rates were enhanced.
| Conclusion |
|---|
|
|
|---|
The above results indicate that the types of membrane proteins are predictable with considerable accuracy. The development of the statistical prediction of protein attributes generally consists of two aspects: constructing a training dataset and formulating a prediction algorithm. The latter also consists of two aspects, i.e. how to define a protein and how to operate the prediction. The process in expressing a protein from the 20-D amino acid composition space (Nakashima et al., 1986
)-D pseudo-amino acid composition space (Chou, 2001
-SVM algorithm reflects the development in operating algorithms. The weighted
-SVM algorithm is particularly useful in solving the problem caused by uneven sizes of the subsets in the training dataset or dealing with the case where the classification accuracy is focused on a small subset. | Appendix A: Weighted support vector machines |
|---|
|
|
|---|
Owing to its characteristics of global optimization, sparseness of the solution and the use of kernel-induced feature spaces, SVMs are widely used in bioinformatics. SVMs have been applied to protein fold recognition (Ding and Dubchak, 2001
-SVM (Scholkopf et al., 2000
Based on the concept of weighted-SVM (Fan et al., 2003
), we adopt its specific form, the weighted
-SVM, and apply it to the problem of prediction of membrane protein types.
-SVM
The basic idea of applying SVMs to pattern classification can be outlined as follows. First, map the input vectors into a feature space (possible with a higher dimension), either linearly or non-linearly, which is relevant to the selection of the kernel function. Then, within the feature space, seek an optimized linear division, i.e. construct a hyper-plane which can separate the entire samples to two classes (this can be extended to multi-classes) with the least errors and maximal margin. The SVMs training process always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features. A complete description to the theory of SVMs for pattern recognition was given by Vapnik (1998)
.
Given a set of
samples, i.e. a series of input vectors
![]() | (A1) |
d is a Euclidean space with d dimensions. Since the multi-class identification problem can always be converted into a two-class identification problem, without loss of generality, the formulation below is given for the two-class case only. Suppose that the output derived from the learning machine is expressed by yi
{+1, 1} (i = 1, ..., N), where the indices 1 and +1 are used to stand for the two classes concerned, respectively, the goal here is to construct one binary classifier or derive one decision function from the available samples that has a small probability of misclassifying a future sample.
-SVM (Scholkopf et al., 2000
) uses the parameter
to control the number of support vectors and errors. Its primal problem is
![]() | (A2) |
![]() | (A3) |
![]() | (A4) |
To calculate b and
in the above equation, we need to select the same number of samples (S > 0 is the number of samples) from the two datasets. Suppose S+ is the number of samples from the positive training dataset and S that from the negative training dataset. According to the KarushKuhnTucker (KKT) conditions (Karush, 1939
; Cristianini and Shawe-Taylor, 2000
), the condition in Equation A2
![]() | (A5) |
![]() | (A6) |
![]() | (A7) |
:
![]() | (A8) |
![]() | (A9) |
In C-SVM, the only adjustable parameter is the constant C, which influences its performance greatly. However, because there is no natural interpretation of this parameter, it is hard to adjust it. In
-SVM, the parameter C is replaced by
. In this parameterization (Equation A3),
places a lower bound on the sum of the
i, which causes the linear term to be dropped from the objective function. Another connection between these two algorithms is that an increase in parameter C leads to a decrease in the number of support vectors in C-SVM, while a decrease in
leads to a smaller number of support vectors.
Weighted
-SVM
The performance of
-SVM is also impaired when training sets with uneven class sizes are used. We propose the weighted
-SVM to solve this problem. The primal problem of weighted
-SVM is given by
![]() | (A10) |
![]() | (A11) |
i
0, ßi
0,
0 are all Lagrange multipliers. Differentiating and imposing a stationary condition, we obtain
![]() | (A12) |
![]() | (A13) |
![]() | (A14) |
![]() | (A15) |
![]() | (A16) |
In the weighted
-SVM, by assigning training samples of different classes different weights, we can compensate for the unfavorable impact caused by the uneven class size. For simplicity, we analyze the case of two classes. We assign si = s+ for positive class and si = s for negative class. Hence the constraint in Equation A11 is reduced to
![]() | (A17) |
![]() | (A18) |
Because most of the gaps are not zero in the trained SVMs, according the constraint in Equation A1,
is greater than zero. By KuhnTucher theory, we obtain 
= 0. If
> 0, then
= 0. Substituting this into Equation A6, we obtain
![]() | (A19) |
![]() | (A20) |
Before further analysis, we need to define several notations. The support vector (SV) is the training sample whose dual variables
i
0. The normal support vector (NSV) is defined as the training sample whose dual variables
. The boundary support vector (BSV) is the training sample that both satisfy
and
i > 0. The BSVs are misclassified training points.
Because the BSV has the property
, the sum of NBSV+
i (BSV's dual variable) is less than
, which is the number of BSVs that belong to the positive class:
![]() | (A21) |
![]() | (A22) |
i is
and Nsv+ denotes the number of training samples belonging to the positive class. Thus, we have
![]() | (A23) |
![]() | (A24) |
![]() | (A25) |
![]() | (A26) |
![]() | (A27) |
![]() | (A28) |
+ is the number of positive training samples,
the number of negative training samples and
=
+ +
. The
in Equation A18 and
in Equation A19 may be interpreted as the rates of accuracy of positive and negative classes, respectively. From Equations A20 and A21, it can be shown that the rate of accuracy for positive class is upper bounded by
and the rate of accuracy for negative class is upper bounded by
. Therefore, in order to balance two classes' rates of accuracy, we only need to force
. Then we have the following equation:
![]() | (A29) |
In the weighted
-SVM, by weighting the samples in the small class, the classification accuracy of the small class can be improved. Meanwhile, eliminating the bias toward the large class, the prediction accuracy of the large class is reduced slightly. Such a method can be applied directly to the prediction of membrane protein types where the training sets of five classes are highly uneven.
In this paper, we use the one-against-one approach (Knerr et al., 1990
), in which k(k 1)/2 classifiers are constructed and each one trains data from two different classes; k is the number of classes and in this paper k = 5. During the training stage, we first calculate the ratio of different sample sizes according to Equation A29 and then assign different weights to the training samples of different classes as described in the next section. In classification we use a voting strategy: each binary classification is considered to be a voter where votes can be cast for all data points; the end point is designated to be in a class with maximum number of votes.
| Acknowledgments |
|---|
The authors wish to thank the three anonymous referees whose comments were very helpful in strengthening the presentation of this work.
| References |
|---|
|
|
|---|
Bock,J.R. and Gough,D.A. (2001) Bioinformatics, 17, 455460.
Cai,Y.D. (2001) Proteins, 43, 336338.[CrossRef][ISI][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2000) Mol. Cell Biol. Res. Commun., 4, 230233.[CrossRef][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002a) Comput. Chem., 26, 293296.[CrossRef][ISI][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002b) Internet Electron. J. Mol. Des., 1, 219226.
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002c) Peptides, 23, 205208.[CrossRef][ISI][Medline]
Cai,Y.D., Liu,X.J., Xu,X.B. and Chou,K.C. (2002d) J. Cell. Biochem., 84, 343348.[CrossRef][ISI][Medline]
Cai,Y.D., Lin,S. and Chou,K.C. (2003a) Peptides, 24, 159161.[CrossRef][ISI][Medline]
Cai,Y.D., Zhou,G.P. and Chou,K.C. (2003b) Biophys. J., 84, 32573263.
Cedano,J., Aloy,P., Perez-Pons,J.A. and Querol,E. (1997) J. Mol. Biol., 266, 594600.[CrossRef][ISI][Medline]
Chew,H.G., Crisp,D.J. and Bogner,R.E. (2000) Target detecting in radar imagery using support vector machines with training size biasing. Singapore.
Chew,H.G., Bogner,R.E. and Lim,C.C. (2001) Dual nu-support vector machine with error rate and training size biasing. Salt Lake City, pp. 12691272.
Chou,J.J. and Zhang,C.T. (1993) J. Theor. Biol., 161, 251262.[CrossRef][ISI][Medline]
Chou,K.C. (1988) Biophys. Chem., 30, 348.[CrossRef][ISI][Medline]
Chou,K.C. (1989) Trends Biochem. Sci., 14, 212.[CrossRef][ISI][Medline]
Chou,K.C. (1995) Proteins, 21, 319344.[CrossRef][ISI][Medline]
Chou,K.C. (1999) J. Protein Chem., 18, 473480.[CrossRef][ISI][Medline]
Chou,K.C. (2000) Biochem. Biophys. Res. Commun., 278, 477483.[CrossRef][ISI][Medline]
Chou,K.C. (2001) Proteins, 43, 246255; Erratum, 2001, 44, 60.[CrossRef][ISI][Medline]
Chou,K.C. (2002) In Weinrer,P.W. and Lu,Q. (eds), Gene Cloning and Expression Technologies. Westborough, MA, Eaton Publishing, pp. 5770.
Chou,K.C. and Cai,Y.D. (2002) J. Biol. Chem., 277, 4576545769.
Chou,K.C. and Cai,Y.D. (2003) J. Cell. Biochem., 90, 12501260; Addendum, 2004, 91, 1085.[CrossRef][ISI][Medline]
Chou,K.C. and Elrod,D.W. (1999) Proteins, 34, 137153.[CrossRef][ISI][Medline]
Chou,K.C. and Zhang,C.T. (1994) J. Biol. Chem., 269, 2201422020.
Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275349.[ISI][Medline]
Chou,P.Y. (1980) In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.
Cristianini,N. and Shawe-Taylor,J. (2000) Support Vector Machines. Cambridge: Cambridge University Press.
Ding,C.H. and Dubchak,I. (2001) Bioinformatics, 17, 349358.
Fan,X.W., Du,S.X. and Wu,T.J. (2003) J. Image Graphics, 8, 10371042.
Guo,Z.M. (2002) Master's Thesis, Shanghai Jiaotong University.
Hua,S.J. and Sun,Z.R. (2001) J. Mol. Biol., 308, 397407.[CrossRef][ISI][Medline]
Karush,W. (1939) MSc Thesis, University of Chicago.
Knerr,S., Personnaz,L. and Dreyfus,G. (eds) (1990) Single-layer Learning Revisited, a Stepwise Procedure for Building and Training Neural Networks. Berlin: Springer.
Lee,Y.J. and Mangasarian,O.L. (2001) RSVM, Reduced Support Vector Machines.
Lin,C.F. and Wang,S.D. (2002) IEEE Trans. Neural Networks, 13, 464471.[CrossRef]
Liu,W. and Chou,K.C. (1999) Protein Eng., 12, 10411050.
Mahalanobis,P.C. (1936) Proc. Natl Inst. Sci. India, 2, 4955.
Mardia,K.V., Kent,J.T. and Bibby,J.M. (1979) Multivariate Analysis. London: Academic Press, pp. 322381.
Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 5461.[CrossRef][ISI][Medline]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152162.
Pan,Y.X., Zhang,Z.Z., Guo,Z.M., Feng,G.Y., Huang,Z.D. and He,L. (2003) J. Protein Chem., 22, 395402.[CrossRef][ISI][Medline]
Pillai,K.C.S. (1985) In Kotz,S. and Johnson,N.L. (eds), Encyclopedia of Statistical Sciences. New York: Wiley, pp. 176181.
Schneider,G. and Wrede,P. (1994) Biophys. J., 66, 335344.[ISI][Medline]
Scholkopf,B., Smola,A. and Williamson,R.C. (2000) Neural Comput., 12, 12071245.
Suykens,J. and Vandewalle,J. (1999) Neural Process. Lett., 9, 293300.[CrossRef]
Vapnik,V.N. (1995) The Nature of Statistical Learning Theory. Berlin: Springer.
Vapnik,V.N. (1998) Statistical Learning Theory. New York: Wiley-Interscience.
Zhou,G.P. (1998) J. Protein Chem., 17, 729738.[CrossRef][ISI][Medline]
Zhou,G.P. and Assa-Munt,N. (2001) Proteins, 44, 5759.[CrossRef][ISI][Medline]
Zhou,G.P. and Doctor,K. (2003) Proteins, 50, 4448.[CrossRef][ISI][Medline]
Received May 10, 2004; revised July 23, 2004; accepted July 26, 2004.
Edited by Alan Fersht
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Bhasin and G. P. S. Raghava GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors Nucleic Acids Res., July 1, 2005; 33(suppl_2): W143 - W147. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||














































