Skip Navigation


PEDS Advance Access originally published online on December 19, 2007
Protein Engineering Design and Selection 2008 21(1):37-44; doi:10.1093/protein/gzm084
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/1/37    most recent
gzm084v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, B.
Right arrow Articles by Vihinen, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, B.
Right arrow Articles by Vihinen, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oxfordjournals.org

Physicochemical feature-based classification of amino acid mutations

Bairong Shen1,3, Jinwei Bai1 and Mauno Vihinen1,2

1 Institute of Medical Technology, FI-33014 University of Tampere, Finland 2Research Unit, Tampere University Hospital, FI-33520 Tampere, Finland

3 To whom correspondence should be addressed. E-mail: bairong.shen{at}uta.fi


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
A huge quantity of gene and protein sequences have become available during the post-genomic era, and information about genetic variations, including amino acid substitutions and SNPs, is accumulating rapidly. To understand the effects of these changes, it is often essential to apply bioinformatics tools. Where there is a lack of homologous sequences or a three-dimensional structure, it becomes essential to predict the effects of mutations based solely on protein sequence information. Several computational methods utilizing machine learning techniques have been developed. These predictions generally use the 20-alphabet amino acid code to train the model. With limited available data, the 20-alphabet amino acid features may introduce so many parameters that the model becomes over-fitted. To decrease the number of parameters, we propose a physicochemical feature-based method to forecast the effects of amino acid substitutions on protein stability. Protein structure alterations caused by mutations can be classified as stabilizing or destabilizing. Based on experimental folding-unfolding free energy ({Delta}{Delta}G) values, we trained a support vector machine with a cleaned data set. The physicochemical properties of the mutated residues, the number of neighboring residues in the primary sequence and the temperature and pH were used as input attributes. Different kernel functions, attributes and window sizes were optimized. An average accuracy of 80% was obtained in cross-validation experiments.

Keywords: amino acid/mutation/physicochemical properties/protein stability/support vector machine


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
Mutated proteins are important and helpful for investigating protein functions, testing hypotheses and understanding genotype–phenotype relationships, as well as for rational protein design and engineering. It is well known that accumulation of autosomal mutations can lead to cancers. Hereditary diseases are caused by one or more germline mutations. All sorts of genetic variations, including single nucleotide polymorphisms (SNPs), are now being identified in large numbers (Consortium, 2005Go; Chanock et al., 2007Go). In protein engineering, improvements to the thermostability and the catalytic properties of proteins have been the main goals (Chica et al., 2005Go; Bommarius et al., 2006Go). To investigate the functions of proteins, to understand the molecular mechanisms of diseases and to design stable proteins, site-directed mutagenesis techniques are widely used (Kearns-Jonker et al., 2007Go; Rajendhran and Gunasekaran, 2007Go), although the design and construction of mutations is often time-consuming and costly. Information about disease-causing genetic alterations has been collected in databases (Stenson et al., 2003Go; Väliaho et al., 2006Go; Horaitis et al., 2007Go). Mutations require further explanations in order to understand the molecular mechanisms of diseases. Computational methods have been developed and used for in silico screening and design and for understanding the effects of mutations (Wang et al., 1998Go; Wright and Lim, 2001Go; Shen and Vihinen, 2003Go, 2004Go; Capriotti et al., 2005aGo, 2005bGo; Ferrer-Costa et al., 2005Go; Sobolev et al., 2005Go; Cheng et al., 2006Go; Huang et al., 2007Go).

The consequences of mutations can be estimated using numerous tools. We have applied more than 30 applications to investigate the effects of disease-causing mutations (Thusberg and Vihinen, 2006Go; Väliaho et al., 2006Go). These methods include, for example, general tests for mutation tolerance [programs like SIFT (Ng and Henikoff, 2001Go; Ng and Henikoff, 2003Go) and PolyPhen (Sunyaev et al., 2000Go; Sunyaev et al., 2001Go; Ramensky et al., 2002Go)], sequence conservation and covariance [e.g. ProCon (Shen and Vihinen, 2004Go), aaMI (Gloor et al., 2005Go) and MatrixPlot (Gorodkin et al., 1999Go)], and side chain packing, e.g. Probe (Word et al., 2000Go), etc. One of the most important features is protein stability, which can be estimated based on sequence and protein three-dimensional structural information. Here, we concentrate on methods and tools for the analysis of the effects on stability.

When the protein structure is known, the effects of amino acid substitutions are often investigated using methods based on force-field theory, such as the free energy perturbation (FEP) technique (Rao et al., 1987Go; Hirono and Kollman, 1991Go; Kato et al., 2006Go). The FEP technique calculates the free energy difference between normal and mutated structures by molecular dynamics simulation. This method is very computationally intensive and yet the results may still be sensitive to the computational procedures and are sometimes unreliable (Shi et al., 1993Go). The empirical force field of FoldX (Guerois et al., 2002Go) allows significantly faster run times, which also facilitates its use in design for protein engineering. The force field has been optimized for point mutations. FodX can also be run from the SNPeffect server (Reumers et al., 2005Go, 2006Go).

Several methods that require less computation have therefore been developed and applied to amino acid mutation analysis. We can group these methods into four categories. The methods of the first category make predictions based on structural information, especially that of amino acid side chain rotamers, packing quality and residue–residue contacts (Tuffery et al., 1997Go; Sobolev et al., 1999Go; Word et al., 2000Go; Wang and Moult, 2001Go; Wright and Lim, 2001Go; Shen and Vihinen, 2003Go; Cuff and Martin, 2004Go). These techniques either compare the side chain chi ({chi}) angle values to the backbone independent rotamer library and examine conformational space for the mutant side chain, or estimate the residue–residue contact properties. The second method category uses the information from both multiple sequence alignments and from protein three-dimensional structures. The amino acid variations in families of related proteins are converted into propensity and substitution tables. The existence of an amino acid in a structural environment and the probability of the substitutions are estimated quantitatively (Topham et al., 1997Go). With the information from both the sequence and structural levels, multiple regression equations can be fitted to predict the folding–unfolding free energy difference for the mutations (Gromiha et al., 1999aGo, 1999bGo; 2000Go; Huang et al., 2007Go). The methods in the third category calculate a position-specific amino acid distribution based on multiple sequence alignments (Ng and Henikoff, 2001Go; Sunyaev et al., 2001Go; Ferrer-Costa et al., 2004Go, 2005Go; Shen and Vihinen, 2004Go). According to the distribution, the amino acid substitutions are classified as tolerated or deleterious. The fourth category of methods uses solely protein sequence information and predicts the protein stability by machine learning algorithms (Capriotti et al., 2004Go, 2005aGo, 2005bGo; Cheng et al., 2006Go). Although useful information about the effects of mutations can be obtained with the above methods, there is still a need for improved and more accurate prediction methods.

The method we have developed belongs to the last category mentioned above and aims to predict the stability of amino acid substitutions based on single sequence information. These kinds of methods are useful and, in fact, are the only tools applicable in cases without close homologues and a known structure. Compared to the previous works (Capriotti et al., 2004Go, 2005aGo, 2005bGo; Cheng et al., 2006Go), our method uses a cleaned data set and takes the physicochemical similarity of amino acids into account during the training of the expert system. The new method uses reasonable number of parameters and is more robust against over-fitting and has better generality.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
Our work aims to classify mutations as stabilizing or destabilizing. For machine learning-based classification, we need to build a model, obtain a set of data and select the training attributes. A support vector machine (SVM) was used as the model and trained with data extracted from the Protherm database (Gromiha et al., 2002Go). The input attributes used for training were the physicochemical properties of amino acids in the protein primary sequence.

Support vector machine

The basic idea for the SVM is shown in Fig. 1. The linear-inseparable data in the input space can be transformed by suitable kernel functions to a high dimensional feature space, where the data can then be separated linearly.


Figure 1
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Principles of the SVM model. With a suitable kernel function, the linear-inseparable data (in input space) can be transferred to linear separable in the high dimensional space (in feature space).

 
The SVM algorithm was implemented as follows: Given a dataset (X) to classify into two classes, the decision function for the data x-> is

Formula 084M1 1
where yi isin {–1, +1}, and {alpha}i is a positive number which can be obtained by maximizing the margin shown in Fig. 1. The maximizing is actually performed by the following quadratic function,

Formula 084M2 2
subject to

Formula 084M3 3

Here, C is an adjustable parameter which controls the trade-off between training error and the margin. Non-zero xj->s are known as support vectors, which are the points lying closest to the separating hyper plane. Equation (2) is a quadratic programming (QP) problem and only one minimum exists. Since the SVM algorithm can find the global minimum, it has an advantage over neural networks (NN). Another advantage is that the SVM algorithm does not require much computer power even when the feature space dimensions increase, since the data points only appear in the inner products of the vectors.

Input attributes

For training the model, physicochemical properties including hydropathy (Eisenberg et al., 1984Go), flexibility (Vihinen et al., 1994Go), electronic charge concentration and the isotropic surface area (ISA) (Collantes and Dunn, 1995Go) of amino acids were used as input attributes (Table I). The residue flexibility was calculated by considering their neighboring residues (Karplus and Schulz, 1985Go; Vihinen et al., 1994Go). The environment of the mutated residue was taken into account by including the neighboring residues with a sliding window technique. Different window sizes from 3 to 19 were tested to optimize the prediction performance.


View this table:
[in this window]
[in a new window]

 
Table I. Physicochemical properties of amino acids

 
Since the thermodynamic parameter of Gibbs free energy change is related to experimental conditions, such as temperature and pH value, we took this information into account as the input attributes for the SVM training and learning. The total number of input attributes for the SVM model was Np *(Winsize +1) + 2, where Np is the number of physicochemical properties and Winsize is the size of the sliding window. Two additional Np attributes, temperature and pH value, were considered for the mutants.

Data set and SVM implementation

The dataset to train the SVM was extracted from the ProTherm database http://gibk26.bse.kyutech.ac.jp/jouhou/Protherm/protherm.html (Gromiha et al., 2002Go). Since the experimental error for the measurement of {Delta}{Delta}G could be ±0.4~0.5 kcal/mol (Khatun et al., 2004Go), the data with |{Delta}{Delta}G| < 0.5 kcal/mol would be difficult to classify. Thus, we only extracted the cases with |{Delta}{Delta}G| > 0.5 kcal/mol for the model training and testing. During further data cleaning, we removed double mutations and averaged the values when several {Delta}{Delta}G values from different resources were available for a single mutation. Finally, we had data for 1448 mutations in 68 proteins. One thousand hundred of these were destabilizing ({Delta}{Delta}G ≤–0.5 kcal/mol) and 348 were stabilizing ({Delta}{Delta}G ≥ 0.5 kcal/mol) alterations. The proteins and numbers of mutations are listed in Table II. For the SVM training, the mutations with {Delta}{Delta}G ≥ 0.5 kcal/mol were labeled as positive cases, with yi = +1, and the mutations with {Delta}{Delta}G ≤–0.5 kcal/mol were labeled as negative cases, with yi = –1.


View this table:
[in this window]
[in a new window]

 
Table II. Point mutations (total 1488) in the 68 proteins used in the classification

 
SVMlight (Joachims, 1999Go) (http://svmlight.joachims.org/) was used for model building. Other calculations, such as the physicochemical property profiles of sequences and Matthew’s correlation coefficient (MCC) calculations, were performed with proprietary PERL scripts.

Accuracy of predictions

The accuracy of the predictions was measured by 10-fold cross-validation, which partitioned the data randomly into 10 sets and trained on 9/10ths of the data, then tested on the remaining 1/10th, repeated on each of the 10 sets, and averaged the results. The quality of the prediction is described by four parameters: accuracy, recall, precision and MCC.

Formula 084M4 4

Formula 084M5 5

Formula 084M6 6
and

Formula 084M7 7
where tp is the number of positive cases that were correctly predicted, tn is the number of negative cases correctly predicted, fp is the number of positive cases incorrectly predicted and fn is the number of negative cases incorrectly predicted.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
Foundation of the prediction

According to Anfinsen’s hypothesis (Anfinsen, 1973Go), the information of a protein’s tertiary structure is encoded in its primary sequence. Therefore, we can assume that stability information can also be explained with the sequence information. Different machine learning approaches have commonly been used to study sequence-structure relationships to make predictions, e.g. for protein secondary structures (Hoang et al., 2002Go; Boden et al., 2006Go), the solvent accessibility of residues (Adamczak et al., 2004Go; Sim et al., 2005Go), residue–residue contacts (Grana et al., 2005Go; Yuan, 2005Go; Cheng and Baldi, 2007Go) and for protein three-dimensional structures (Klepeis et al., 2005Go; Zhou and Skolnick, 2007Go).

Several machine learning methods are available for extracting patterns from complex data. NN and SVMs are among the most widely used methods for biological problems. SVMs have been applied successfully in many biological data analyses, such as in microarray data classification (Chu and Wang, 2005Go; Huang and Chang, 2006Go; Doran et al., 2007Go), protein solvent accessibility prediction (Yuan et al., 2002Go; Kim and Park, 2004Go; Wang et al., 2007Go), protein secondary structure prediction (Hu et al., 2004Go; Wang et al., 2004Go) and in protein stability analysis (Capriotti et al., 2005aGo, 2005bGo; Yue et al., 2005Go; Cheng et al., 2006Go).

The performance of SVMs can be adjusted and optimized by changing the kernel. We tested different kernels in the SVM for performance speed and learning. We tested linear, polynomial and radial basis function kernels, which are described in equations (8)(10).

Linear kernel:


Formula 084M8 8

Polynomial kernel:


Formula 084M9 9

Radial basis function kernels:


Formula 084M10 10

Here, d, {sigma} and {gamma} are adjustable parameters, and xi-> and xj-> are the data for classification. The radial basis function kernel was found to be the best for the speed and quality of the prediction. The other kernels performed poorly with accuracies below 75%. A similar result was obtained when an SVM was applied to the prediction of sub-cellular localization (Hua and Sun, 2001aGo, 2001bGo).

Performance of the prediction

Four different physicochemical descriptors, namely hydropathy, flexibility, ISA and electronic charge concentration, were used to characterize the amino acids. The applicability of the parameter combinations was tested. The prediction accuracy is plotted for different parameters and window lengths (ranging from 3 to 19) in Fig. 2. The best prediction results for each parameter combination are listed in Table III. The adjustable parameters C in equation (3) and {gamma} in equation (10) were optimized to the range of C = 1–30, {gamma} = 0.1–0.4. With C = 2 and {gamma} = 0.1, the accuracy of the best performance for the prediction is 80.45% and the corresponding MCC is 0.39 (See Table III). Table III lists the highest scores among the optimized conditions. The scores were calculated by averaging over ten cross-validations. The accuracy for the cross validations (80.45%) for all four amino acid parameters with a window length of 13 is about 3% higher than the previously reported result (77%) (Capriotti et al., 2005aGo, 2005bGo).


Figure 2
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The relationship between window sizes, different attributes and prediction accuracies. For the SVM model, the prediction performance of different models is related to the features and the window sizes. The five lines represent the five input features in Table III. Empty square, HIFE; filled square, HIE; empty circle, IFE; empty triangle, HIF; filled circle, HFE. The symbols H, I, E, F mean the same as in Table III.

 

View this table:
[in this window]
[in a new window]

 
Table III. Accuracy values for predictions with different input parametersa

 
Input attributes and window size

The choice of attributes is crucial for pattern recognition or classification. According to the results in Table III, the predictions based on all four physicochemical properties are better than the others. The performance is surprisingly even for the different predictors. In addition, the differences between the parameter combinations are relatively marginal. We also trained and predicted using just one or two physicochemical properties and found that the prediction performances were very poor (accuracy below 75% and recall close to 0). This indicates that the stabilizing/destabilizing classification is complex and it cannot be obtained with too few features. If ISA is not included, the performance shows the most significant drop in accuracy. It is well known that hydrophobic effect is one of the principal forces stabilizing protein structures (Kauzmann, 1959Go), but it is not the only factor and the prediction solely based on hydropathy had a poor performance (accuracy below 75% recall is almost 0). ISA is a measurement of both the size and the proportion of the residue which is accessible for non-specific interactions with the solvent (Dunn et al., 1987Go; Collantes and Dunn, 1995Go). It characterizes both residue size and hydropathy. A change in the size of a side-chain of an amino acid is essential for structural stability at buried sites (Wang et al., 1998Go; Liu et al., 2000Go; Ferrer-Costa et al., 2002Go).

The stability of a protein is determined by complex residue–residue contacts and interactions. Without structural information, the environment of a residue in many predictions is considered partially by taking into account the properties of neighbors with a sliding window average technique (Edelman and White, 1989Go; Hofmann and Stoffel, 1992Go; Vihinen et al., 1994Go; Fares, 2004Go). For the stability of a protein, it is difficult to account for the complex relationships between residues that are separated by a large distance in the sequence and the protein stability, since the structural neighboring residues may be located both within the local sequence neighborhood and far away in the sequence.

Support vectors

Support vectors are training examples that lie close to the decision boundary between the two classes. The SVM focuses upon the small subset of examples (SVs) that are critical and informative to the classification and throws out the remaining examples. Removal of SVs changes the location of the separating hyperplane and alters the efficiency of the classification. The number of support vectors was more than one-third of the sample size (Table III) indicating the difficulty of the classification. This observation is easy to understand by comparing the size of sequence space and structure folding space. Even very different protein sequences can fold to a similar three-dimensional structure. On the other hand, relatively similar sequences can have different folds in proteins. Protein structures can be robust for point mutations (Taverna and Goldstein, 2002Go). Still numerous diseases arise due to single mutations. Residues within a protein take part in complex and non-linear interaction networks, which makes single sequence-based predictions difficult, especially when accounting for residues located far away in the sequence but close together in the folded structure.

Comparison with other sequence-based methods

Two sequence-based approaches, using single or multiple sequences, can be utilized for analyzing amino acid mutations. The multiple sequence methods are generally based on protein score matrices, such as the widely used PAM (Percent of Point Accepted Mutation) and BLOSUM (Blocks Substitution Matrix) (Topham et al., 1997Go; Boland and Murphy, 2001Go; Ferrer-Costa et al., 2002Go). Amino acid mutations are constrained evolutionarily by two factors: structural and functional constraints. The score matrices reflect both of these constraints. Many of the functional residues locate on the surfaces of proteins, which may have little effect on the protein structure and stability. For the classification of stabilizing and destabilizing mutations, it may not be necessary to account for the functional constraint. Therefore, the matrices which include both structural and functional information may in fact decrease the discriminative power of the classifier.

In Fig. 3, the relationship between protein stability (experimental {Delta}{Delta}G) and change in physico-chemical properties ({Delta}H,{Delta}I,{Delta}E,{Delta}F) are shown. It is clear that no simple tendencies or relations could be found. Therefore, a non-linear model is necessary to describe and predict the relationships.


Figure 3
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The relationship between stability change (experimental {Delta}{Delta}G) and changes in physicochemical properties. The symbols H, I, E, F mean the same as in Table III.

 
Our method uses single sequence information. Previously, this kind of classification has been based upon 20-alphabet attributes and SVM machine learning (Capriotti et al., 2005aGo, 2005bGo; Cheng et al., 2006Go). The results for the comparison of our model and the previous models are summarized in Table IV. The highest accuracy is for the method of Cheng et al. with one dataset, but the results vary greatly for this method with different test sets. Of note is the high variation, especially in the MCC values. The developers of the previous methods have used 20-fold cross validation, where the test set is smaller compared to training set. Our method gains 1% in accuracy and 0.2 in MCC compared to the results in Table IV if we use 20- instead of 10-folds.


View this table:
[in this window]
[in a new window]

 
Table IV. Comparison of single sequence-based prediction methods

 
The present model improves the prediction in two aspects. First, we use the cleaned data set for the training and testing of our model. We excluded data points with a {Delta}{Delta}G between –0.5 and 0.5 kcal/mol, because in this range the true difference cannot be separated due to experimental uncertainty. This data may mislead the classification since the experimental error can be up to ±0.5 kcal/mol and thus the mutations could not be precisely grouped as stabilizing or destabilizing. We analyzed the classification result of the previous work by Capriotti et al. The data set was taken from the authors’ webpage (http://gpcr.biocomp.unibo.it/~emidio/I-Mutant2.0/dbMutSeq.html). This site lists the data set used for training and testing their SVM models. The experimental {Delta}{Delta}Gs and their predicted {Delta}{Delta}Gs are also given. In total, there are 2048 cases in the list, 594 of which have experimental {Delta}{Delta}G values between 0.5 and –0.5 kcal/mol; 492 (24.0%) of the 2048 cases are misclassified. Figure 4 shows the misclassified cases. The signs of the predicted and experimental {Delta}{Delta}Gs are different, thus the products of the experimental {Delta}{Delta}G and the predicted {Delta}{Delta}G are less than zero. The majority of the misclassifications (250 cases, i.e. 50.8% of 492 cases) have experimental {Delta}{Delta}G values between 0.5 and –0.5 kcal/mol (the data between the two lines in Fig. 4).


Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The analysis of the previous prediction results by Capriotti et al. (Capriotti et al., 2005aGo, 2005bGo). For misclassified data, the signs of the predicted {Delta}{Delta}G and the experimental {Delta}{Delta}G are different, so their product is less than zero. Only the misclassified cases are shown, most of which (50.8%) are distributed in the region with experimental {Delta}{Delta}G values between 0.5 and –0.5 kcal/mol.

 
For the other improvement, we use the numeric physicochemical properties of amino acids instead of the 20-alphabet. The properties of amino acids as attributes better explain the characteristics of residues. Protein folding and stability are mainly determined by the properties of amino acids in the primary sequence (Anfinsen, 1973Go). For 20-alphabet binary attributes, the number of input attributes is 20*(Winsize +1) + 2, which is about five times the number of attributes used in our model when utilizing the four physicochemical properties of amino acids as attributes. Furthermore, the similarity of amino acids has to be learnt from the training set. According to the conventional rule of thumb for pattern recognition, the number of training samples should be 5–10 times the number of model parameters (Kanal and Chandrasekaran, 1971Go). Optimization revealed the window size of 13 as the best classifier. We therefore used 58 (4x13 + 4 + 2) model parameters for the training. The binary representation for 20 amino acids would need 282 (20x13 + 20 + 2) parameters for the training, which increases the required size of the training data set from 1420 to 2840. However, the presently available data for training is very limited (we have ~1500 cases, and only 1350 cases could be used for training when using the 10-fold cross-validation method). With fewer parameters, our model is more robust against over-fitting and has better generality.


    Conclusions
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
The problem of protein structure availability in the whole genome era has been redefined as the determination of representative structures of conserved protein families (Redfern et al., 2005Go; Chandonia and Brenner, 2006Go). But even with the success of this program, the demand for computational methods for discerning structural information from primary sequences will be higher than ever, since the protein sequences are widely diverse.

Site-directed mutagenesis is a commonly used technique. However, the relationship between a mutation and protein stability is often still an unresolved and difficult problem. Many studies are based on limited data and empirical rules. As the amount of data is continuously increasing, our task is to improve the prediction accuracy and refine the model. Our study indicates a promising approach to predicting mutation effects based solely on single sequence information.


    Footnotes
 
Edited by Jane Clarke


    Acknowledgement
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
We gratefully acknowledge the financial support of the Medical Research Fund of Tampere University Hospital.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Acknowledgement
 References
 
Adamczak R., Porollo A., Meller J. Proteins (2004) 56:753–767.[CrossRef][Web of Science][Medline]

Anfinsen C.B. Science (1973) 181:223–230.[Free Full Text]

Boden M., Yuan Z., Bailey T.L. BMC Bioinformatics (2006) 7:68.[CrossRef][Medline]

Boland M.V., Murphy R.F. Bioinformatics (2001) 17:1213–1223.[Abstract/Free Full Text]

Bommarius A.S., Broering J.M., Chaparro-Riggers J.F., Polizzi K.M. Curr. Opin. Biotechnol. (2006) 17:606–610.[CrossRef][Web of Science][Medline]

Capriotti E., Fariselli P., Casadio R. Bioinformatics (2004) 20(Suppl. 1):I63–I68.[CrossRef][Medline]

Capriotti E., Fariselli P., Calabrese R., Casadio R. Bioinformatics (2005) a 21(Suppl. 2):ii54–ii58.[Abstract]

Capriotti E., Fariselli P., Casadio R. Nucleic Acids Res. (2005) b 33:W306–W310.[Abstract/Free Full Text]

Chandonia J.M., Brenner S.E. Science (2006) 311:347–351.[Abstract/Free Full Text]

Chanock S.J., et al. Nature (2007) 447:655–660.[CrossRef][Medline]

Cheng J., Randall A., Baldi P. Proteins (2006) 62:1125–1132.[CrossRef][Web of Science][Medline]

Cheng J., Baldi P. BMC Bioinformatics (2007) 8:113.[CrossRef][Medline]

Chica R.A., Doucet N., Pelletier J.N. Curr. Opin. Biotechnol. (2005) 16:378–384.[CrossRef][Web of Science][Medline]

Chu F., Wang L. Int. J. Neural. Syst. (2005) 15:475–484.[CrossRef][Web of Science][Medline]

Collantes E.R., Dunn W.J. 3rd. J. Med. Chem. (1995) 38:2705–2713.[CrossRef][Web of Science][Medline]

Consortium T.I.H. Nature (2005) 437:1299–1320.[CrossRef][Medline]

Cuff A.L., Martin A.C. J. Mol. Biol. (2004) 344:1199–1209.[CrossRef][Web of Science][Medline]

Doran M., Raicu D.S., Furst J.D., Settimi R., Schipma M., Chandler D.P. Bioinformatics (2007) 23:487–492.[Abstract/Free Full Text]

Dunn W.J. 3rd, Koehler M.G., Grigoras S. J. Med. Chem. (1987) 30:1121–1126.[CrossRef][Web of Science][Medline]

Edelman J., White S.H. J. Mol. Biol. (1989) 210:195–209.[CrossRef][Web of Science][Medline]

Eisenberg D., Schwarz E., Komaromy M., Wall R. J. Mol. Biol. (1984) 179:125–142.[CrossRef][Web of Science][Medline]

Fares M.A. Bioinformatics (2004) 20:2867–2868.[Abstract/Free Full Text]

Ferrer-Costa C., Orozco M., de la Cruz X. J. Mol. Biol. (2002) 315:771–786.[CrossRef][Web of Science][Medline]

Ferrer-Costa C., Orozco M., de la Cruz X. Proteins (2004) 57:811–819.[CrossRef][Web of Science][Medline]

Ferrer-Costa C., Orozco M., de la Cruz X. Proteins (2005) 61:878–887.[CrossRef][Web of Science][Medline]

Gloor G.B., Martin L.C., Wahl L.M., Dunn S.D. Biochemistry (2005) 44:7156–7165.[CrossRef][Web of Science][Medline]

Gorodkin J., Staerfeldt H.H., Lund O., Brunak S. Bioinformatics (1999) 15:769–770.[Abstract/Free Full Text]

Grana O., Baker D., MacCallum R.M., Meiler J., Punta M., Rost B., Tress M.L., Valencia A. Proteins (2005) 61(Suppl. 7):214–224.[CrossRef][Web of Science][Medline]

Gromiha M.M., Oobatake M., Kono H., Uedaira H., Sarai A. Protein Eng. (1999) a 12:549–555.[Abstract/Free Full Text]

Gromiha M.M., Oobatake M., Kono H., Uedaira H., Sarai A. J. Protein Chem. (1999) b 18:565–578.[CrossRef][Web of Science][Medline]

Gromiha M.M., Oobatake M., Kono H., Uedaira H., Sarai A. J. Biomol. Struct. Dyn. (2000) 18:281–295.[Web of Science][Medline]

Gromiha M.M., Uedaira H., An J., Selvaraj S., Prabakaran P., Sarai A. Nucleic Acids Res. (2002) 30:301–302.[Abstract/Free Full Text]

Guerois R., Nielsen J.E., Serrano L. J. Mol. Biol. (2002) 320:369–387.[CrossRef][Web of Science][Medline]

Hirono S., Kollman P.A. Protein Eng. (1991) 4:233–243.[Abstract/Free Full Text]

Hoang T.X., Cieplak M., Banavar J.R., Maritan A. Proteins (2002) 48:558–565.[CrossRef][Web of Science][Medline]

Hofmann K., Stoffel W. Comput. Appl. Biosci. (1992) 8:331–337.[Abstract/Free Full Text]

Horaitis O., Talbot C.C. Jr, Phommarinh M., Phillips K.M., Cotton R.G. Nat. Genet. (2007) 39:425.[CrossRef][Web of Science][Medline]

Hu H.J., Pan Y., Harrison R., Tai P.C. IEEE Trans. Nanobiosci. (2004) 3:265–271.[CrossRef]

Hua S., Sun Z. Bioinformatics (2001) a 17:721–728.[Abstract/Free Full Text]

Hua S., Sun Z. J. Mol. Biol. (2001) b 308:397–407.[CrossRef][Web of Science][Medline]

Huang H.L., Chang F.L. Biosystems (2006) 90:516–528.[CrossRef][Web of Science][Medline]

Huang L.T., Saraboji K., Ho S.Y., Hwang S.F., Ponnuswamy M.N., Gromiha M.M. Biophys. Chem. (2007) 125:462–470.[CrossRef][Web of Science][Medline]

Joachims T. Making large-Scale SVM Learning Practical (1999) Cambridge, MA: MIT Press.

Kanal L., Chandrasekaran B. Pattern Recognit. (1971) 3:225–234.[CrossRef][Web of Science]

Karplus P.A., Schulz G.E. Naturwissenschaften (1985) 72:212–213.[CrossRef][Web of Science]

Kato M., Pisliakov A.V., Warshel A. Proteins (2006) 64:829–844.[CrossRef][Web of Science][Medline]

Kauzmann W. Adv. Protein Chem. (1959) 14:1–63.[Web of Science][Medline]

Kearns-Jonker M., Barteneva N., Mencel R., Hussain N., Shulkin I., Xu A., Yew M., Cramer D.V. BMC Immunol. (2007) 8:3.[CrossRef][Medline]

Khatun J., Khare S.D., Dokholyan N.V. J. Mol. Biol. (2004) 336:1223–1238.[CrossRef][Web of Science][Medline]

Kim H., Park H. Proteins (2004) 54:557–562.[CrossRef][Web of Science][Medline]

Klepeis J.L., Wei Y., Hecht M.H., Floudas C.A. Proteins (2005) 58:560–570.[CrossRef][Web of Science][Medline]

Liu R., Baase W.A., Matthews B.W. J. Mol. Biol. (2000) 295:127–145.[CrossRef][Web of Science][Medline]

Ng P.C., Henikoff S. Genome Res. (2001) 11:863–874.[Abstract/Free Full Text]

Ng P.C., Henikoff S. Nucleic Acids Res. (2003) 31:3812–3814.[Abstract/Free Full Text]

Rajendhran J., Gunasekaran P. J. Biosci. Bioeng. (2007) 103:457–463.[CrossRef][Web of Science][Medline]

Ramensky V., Bork P., Sunyaev S. Nucleic Acids Res. (2002) 30:3894–3900.[Abstract/Free Full Text]

Rao S.N., Singh U.C., Bash P.A., Kollman P.A. Nature (1987) 328:551–554.[CrossRef][Medline]

Redfern O., Grant A., Maibaum M., Orengo C. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. (2005) 815:97–107.[CrossRef][Web of Science][Medline]

Reumers J., Schymkowitz J., Ferkinghoff-Borg J., Stricher F., Serrano L., Rousseau F. Nucleic Acids Res. (2005) 33:D527–D532.[Abstract/Free Full Text]

Reumers J., Maurer-Stroh S., Schymkowitz J., Rousseau F. Bioinformatics (2006) 22:2183–2185.[Abstract/Free Full Text]

Shen B., Vihinen M. Bioinformatics (2003) 19:2161–2162.[Abstract/Free Full Text]

Shen B., Vihinen M. Protein Eng. Des. Sel. (2004) 17:267–276.[Abstract/Free Full Text]

Shi Y.Y., Mark A.E., Wang C.X., Huang F., Berendsen H.J., van Gunsteren W.F. Protein Eng. (1993) 6:289–295.[Abstract/Free Full Text]

Sim J., Kim S.Y., Lee J. Bioinformatics (2005) 21:2844–2849.[Abstract/Free Full Text]

Sobolev V., Sorokine A., Prilusky J., Abola E.E., Edelman M. Bioinformatics (1999) 15:327–332.[Abstract/Free Full Text]

Sobolev V., Eyal E., Gerzon S., Potapov V., Babor M., Prilusky J., Edelman M. Nucleic Acids Res. (2005) 33:W39–W43.[Abstract/Free Full Text]

Stenson P.D., Ball E.V., Mort M., Phillips A.D., Shiel J.A., Thomas N.S., Abeysinghe S., Krawczak M., Cooper D.N. Hum. Mutat. (2003) 21:577–581.[CrossRef][Web of Science][Medline]

Sunyaev S., Ramensky V., Bork P. Trends Genet. (2000) 16:198–200.[CrossRef][Web of Science][Medline]

Sunyaev S., Ramensky V., Koch I., Lathe W. 3rd, Kondrashov A.S., Bork P. Hum. Mol. Genet. (2001) 10:591–597.[Abstract/Free Full Text]

Taverna D.M., Goldstein R.A. J. Mol. Biol. (2002) 315:479–484.[CrossRef][Web of Science][Medline]

Thusberg J., Vihinen M. Hum. Mutat. (2006) 27:1230–1243.[CrossRef][Web of Science][Medline]

Topham C.M., Srinivasan N., Blundell T.L. Protein Eng. (1997) 10:7–21.[Abstract/Free Full Text]

Tuffery P., Etchebest C., Hazout S. Protein Eng. (1997) 10:361–372.[Abstract/Free Full Text]

Valiaho J., Smith C.I.E., Vihinen M. Hum. Mutat. (2006) 27:1209–1217.[CrossRef][Web of Science][Medline]

Vihinen M., Torkkila E., Riikonen P. Proteins (1994) 19:141–149.[CrossRef][Web of Science][Medline]

Wang J.Y., Lee H.M., Ahmad S. Proteins (2007) 68:82–91.[CrossRef][Web of Science][Medline]

Wang L., Veenstra D.L., Radmer R.J., Kollman P.A. Proteins (1998) 32:438–458.[CrossRef][Web of Science][Medline]

Wang L.H., Liu J., Li Y.F., Zhou H.B. Genome Inform (2004) 15:181–190.[Medline]

Wang Z., Moult J. Hum. Mutat. (2001) 17:263–270.[CrossRef][Web of Science][Medline]

Word J.M., Bateman R.C. Jr, Presley B.K., Lovell S.C., Richardson D.C. Protein Sci. (2000) 9:2251–2259.[Web of Science][Medline]

Wright J.D., Lim C. Protein Eng. (2001) 14:479–486.[Abstract/Free Full Text]

Yuan Z., Burrage K., Mattick J.S. Proteins (2002) 48:566–570.[CrossRef][Web of Science][Medline]

Yuan Z. BMC Bioinformatics (2005) 6:248.[CrossRef][Medline]

Yue P., Li Z., Moult J. J. Mol. Biol. (2005) 353:459–473.[CrossRef][Web of Science][Medline]

Zhou H., Skolnick J. Biophys. J. (2007) 93:1510–1518.[CrossRef][Web of Science][Medline]

Received August 28, 2007; revised October 22, 2007; accepted November 22, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Protein Eng Des SelHome page
S. Kang, G. Chen, and G. Xiao
Robust prediction of mutation-induced protein stability change by property encoding of amino acids
Protein Eng. Des. Sel., February 1, 2009; 22(2): 75 - 83.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/1/37    most recent
gzm084v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shen, B.
Right arrow Articles by Vihinen, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shen, B.
Right arrow Articles by Vihinen, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?