Protein Engineering, Vol. 12, No. 5, 381-385,
May 1999
© 1999 Oxford University Press
A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm
Faculty of Biology, Department of Cell Biology and Biophysics,University of Athens, Panepistimiopolis, Athens 15701, Greece
| Abstract |
|---|
|
|
|---|
We present a novel method that predicts transmembrane domains in proteins using solely information contained in the sequence itself. The PRED-TMR algorithm described, refines a standard hydrophobicity analysis with a detection of potential termini (`edges', starts and ends) of transmembrane regions. This allows one both to discard highly hydrophobic regions not delimited by clear start and end configurations and to confirm putative transmembrane segments not distinguishable by their hydrophobic composition. The accuracy obtained on a test set of 101 non-homologous transmembrane proteins with reliable topologies compares well with that of other popular existing methods. Only a slight decrease in prediction accuracy was observed when the algorithm was applied to all transmembrane proteins of the SwissProt database (release 35). A WWW server running the PRED-TMR algorithm is available at http://o2.db.uoa.gr/PRED-TMR/
Keywords: hydrophobicity analysis/membrane proteins/prediction/protein structure/transmembrane regions
| Introduction |
|---|
|
|
|---|
The prediction of protein structure is still an open problem in molecular biology. Important efforts have especially been devoted to transmembrane proteins because they are involved in a broad range of processes and functions and, unfortunately, it is very difficult to solve their three-dimensional structure by X-ray crystallography (Persson and Argos, 1994
A number of methods or algorithms designed to locate the transmembrane regions of membrane proteins have been developed (von Heijne, 1992
; Persson and Argos, 1994
; Cserzo et al., 1997
). Apparently, in several cases, better results are obtained when extra information coming from multiple alignments of homologous proteins is used (Rost et al., 1993
; Persson and Argos, 1994
). However, when homologies cannot be found in the databases, improvement of prediction methods using information contained in a protein sequence alone is important.
Prediction methods based on a hydrophobicity analysis can highlight most of the transmembrane regions of a protein (von Heijne, 1992
). However, they fail to discriminate perfectly between segments corresponding to real transmembrane parts and simple, highly hydrophobic stretches of residues.
The algorithm presented in this paper refines information given by a hydrophobicity analysis, with the detection of favourable patterns that highlight potential termini (starts and ends) of transmembrane regions. Thus, highly hydrophobic stretches of residues that are not delimited by clear start and end configurations can be discarded. In contrast, favourable patterns can extract some transmembrane regions not clearly distinguishable by their hydrophobic composition.
| Methods |
|---|
|
|
|---|
The aim of a prediction method is to obtain good accuracy when applied to unknown proteins. As emphasized by Rost and Sander (1999), on the basis of two CASP experiments, this objective has not yet been reached. Over-optimistic results of many algorithms are usually due to the use of too small or non-representative data sets.
The PRED-TMR method, presented in this work, is based on a statistical study of transmembrane proteins. Despite the lack of precision and fidelity of SwissProt (Cserzo et al., 1997
), we chose to collect the information needed from the whole database instead of using a limited set that may not be statistically representative.
Our method was optimized on a subset of 64 reliable proteins previously used in several prediction programs (Jones et al., 1994
; Rost et al., 1995
; Aloy et al., 1997
) that were available in the public databases (the sequences used and the results obtained are presented on our web site at http://o2.db.uoa.gr/PRED-TMR/Results/). We relied on transmembrane segment topologies indicated in SwissProt release 35 or, when unavailable, in the paper by Rost et al. (1996).
The reliability of predictions was tested on several sets of sequences used for the rating of recent published algorithms. The PRED-TMR algorithm was also applied to the whole SwissProt database.
Information gathering
Some 9392 transmembrane proteins were automatically extracted from the SwissProt database, release 35, based on the presence in the feature table of the `TRANSMEM' keyword. The information relative to the transmembrane regions and their peripheral residues was stored in a database called DB-TMR. This database contains for each transmembrane segment:
- the access code of the sequence containing the segment (ID line);
- the organism classification (OC lines);
- the length of the transmembrane region;
- the direction of the transmembrane segment when it can be deduced from the keywords `CYTOPLASMIC' and `EXTRACELLULAR' of the feature table;
- five amino acid residues (one-letter code) outside the transmembrane region for the N- and C-terminal sides;
- the amino acid residues (one-letter code) of the transmembrane segment.
This information can easily be filtered by organism or transmembrane type in order to refine the statistical analysis. The database and the description of the format used can be downloaded from our web site at http://o2.db.uoa.gr/DB-TMR/.
To minimize the impact of erroneous information, transmembrane segments that extend beyond the end(s) of the sequenced region or with unknown end-points are discarded before the statistical calculations.
Distribution of transmembrane segment length
The 40 548 transmembrane segments with reliable end-points contained in DB-TMR have an average length of 21.30 residues and a standard deviation of 2.56 residues. The distribution is sharper than a Gaussian distribution, with 60% of the transmembrane segments having a length of 21 residues and 94% having a length between 17 and 25 residues. A simple approximation of the curve is given by the function
|
|
Calculation of amino acid residue transmembrane propensities (potentials)
A propensity for each residue to be in a transmembrane region was calculated using the equation
|
|
Evaluation of the `hydrophobicity' of a sequence of residues
Following a similar, but not identical, definition put forward by Sipos and von Heijne (1993), the table of transmembrane propensities was translated into a new, statistically based, `hydrophobicity' scale defined by
|
|
The `hydrophobicity' of a sequence of residues from position m to position p is evaluated by
|
|
Calculation of favourable terminal (end) configurations of transmembrane regions
Favourable configurations are computed for decapeptides centred at the border of transmembrane regions (five residues outside and five residues inside the membrane). Positions in the decapeptide are counted from 0 to 9. For the N-terminal end (side), thereafter also referred to as `left end', position 0 corresponds to a residue five residues before the first amino acid residue of the transmembrane segment and position 9 corresponds to a residue four residues after this residue. For the C-terminal end (side), thereafter also referred to as `right end', position 0 corresponds to a residue five residues after the last amino acid residue of the transmembrane segment and position 9 corresponds to a residue four residues before this residue (Figure 1
).
|
The propensity for an amino acid of type i to appear at position p in the decapeptide is defined by the equation
|
|
For the N-terminal (`left') side of a transmembrane segment, the propensity Ppleft of an amino acid residue, at position p in the sequence, to be the first one in the lipid-associated structure (the first residue of the transmembrane domain) is defined by the equation
|
|
Similarly, for the C-terminal side (`right') of a transmembrane segment, the propensity for an amino acid at position p to be the first residue outside the transmembrane region is defined by
|
|
However, using only Pleft propensities to find good `left' configurations (or Pright to find `right' configurations) is not sufficient. Some decapeptides can indeed generate high scores for both `left' and `right' propensities. We have, for example, to discard decapeptides such as `ILFVSTFFTM' which give a good value for Pleft of 1.75 and a high value for Pright of 2.61.
By looking at the Pleft and Pright values for known transmembrane segments, we found that the scores themselves are less important than the difference between `left' and `right' values.
We combined both propensities to obtain start and end indicators of transmebrane segments using the equations
|
|
Scoring of transmembrane regions
A well defined transmembrane region should give good scores for all three parameters (LeftInd, RightInd and H). However, when applied to known transmembrane segments, a large proportion scored small values for one or two of these indicators. In most cases, weak indicators are compensated by excellent values obtained for the remaining one(s).
High values can also be obtained for very short or very long segments. These segments of improbable length should be discarded unless the configuration is very clear (when high values are obtained for all three indicators).
We introduce in the scoring formula a negative indicator, which performs a filtering of the probable transmembrane segments depending on their length. This is calculated with
|
|
Each of the four indicators should contribute with the same weight in the evaluation of the score for a segment. After normalization of the hydrophobicity parameter, the score of a sequence from m to p is calculated by
|
|
Prediction algorithm
For each position m in the sequence, the maximum score that can be obtained if this position corresponds to the beginning of a transmembrane region is calculated as
|
|
For each position, the MScorem obtained and the corresponding end position are memorized. In the table generated, the highest MScorem is selected and the corresponding region is marked as transmembrane. Then, the second highest Mscorem that does not overlap with a previously marked region is selected and this process is continued with the next Mscorem, until all possible regions are found.
As an example, consider the table of MScorem obtained for the segment from residue 276 to residue 325 of 5HT3_MOUSE (Table I
). In this table, the program selects the highest MScorem (89 at position 307) and marks the segment from 307 to 324 as transmembrane. Then, it selects the second possible highest Mscorem; 80 at position 310 cannot be selected because this position is part of the first selected transmembrane domain. Also, 69 at position 303 cannot be selected because it represents a segment that ends at position 321, inside the transmembrane domain. The next possible MScorem is 34, at position 282, that represents a transmembrane segment from residue 282 to residue 303. As it is not possible to select a third segment, the program ends. For this region of the protein with observed (putative) transmembrane segments at 278296 and 306324, the algorithm detects two transmembrane domains at 282303 and 307324.
|
| Results |
|---|
|
|
|---|
The predicted transmembrane domains were compared with the experimentally determined topologies calculating for each sequence:
- the percentage of residues predicted correctly (agreement factor), Q, defined by Chou and Fasman (1978);
- the correlation coefficient, C (Fisher, 1958
; Matthews, 1975
);
- the ratio of segment matches, SM, defined by Cserzo et al. (1997).
We optimized the hydrophobicity indicator cut-off on a sub-set of 64 proteins of the set used by Rost et al. (1995) (the sequences 2MLT, GLRA_RAT, GPLB_HUMAN, IGGB_ STRSP and PT2M_ECOLI which were not found in the public databases were not used). The best results were obtained when segments with NHmp <2 were discarded. On the set of 64 proteins, an agreement factor of 88.24% was obtained, with a correlation coefficient of 0.79 and a ratio of segment matches of 0.945.
In order to test the PRED-TMR algorithm, we collected all available sequences used in three recent papers (Rost et al., 1995
, 1996
; Cserzo et al., 1997
) and discarded those with more than 25% homology. The resulting set contains 101 non-homologous transmembrane proteins in total. Details of the results obtained are not shown here, but they can be downloaded together with the list of the transmembrane segment assignments from http://o2.db.uoa.gr/PRED-TMR/Results/.
The results of the test on this set of 101 proteins gave an average Q of 88.83%, a C of 0.80 and a ratio of segment matches, SM, of 0.954. One protein (1%) has a correlation coefficient <0.4 and 10 have C < 0.6 (10%). These scores are similar to those obtained by excluding the proteins used for the optimization of the hydrophobicity indicator cut-off (Q = 87.81%, C = 0.78 and SM = 0.943).
Table II
shows the results produced applying PRED-TMR and five other prediction methods on the set of 101 proteins. Looking at the correlation coefficient, PRED-TMR was found to perform slightly better than the two best methods, PHDhtm and tmPRED, on this set. Concerning the agreement factor, PRED-TMR performs in a similar way to tmPRED and TOPPRED, whereas for the ratio of segment matches it is slightly worse than PHDhtm, which is best.
|
Despite the errors contained in SwissProt, it is thought that a comparison between predicted transmembrane regions and annotated ones, in the entire database, is worthwhile. It can serve as a common test set for algorithms detecting (predicting) transmebrane domains.
SwissProt, release 35, contains 9392 transmembrane sequences with a total of 40 672 transmembrane regions. We did not discard the test transmembrane segments with uncertain end-points as we did to establish the statistics. The PRED-TMR algorithm applied to all proteins contained in the SwissProt database produces slightly lower values for the Q and C scores and a larger decrease of the ratio of segment matches (Q = 86.14, C = 0.73, SM = 0.889) relative to the test set of 101 proteins mentioned above. Of the 9392 proteins, 1710 (18%) have C < 0.6.
| Discussion |
|---|
|
|
|---|
The PRED-TMR algorithm is a very simple and fast algorithm, it is available freely through the Internet and it does not require any additional information other than the protein sequence itself. It is comparable in terms of accuracy to most popular prediction methods.
Since PRED-TMR is a very fast algorithm and requires only information contained in a protein sequence alone, it is predicted that its most potential use will be its application to ORFs (Open Reading Frames) predicted by the various genome projects and especially those ORFs that correspond to proteins with unknown function. Aided by a pre-processing stage which could identify whether the sequence under study pertains to a membrane protein, it will be useful in the recognition of transmembrane domains. Such a pre-processing stage is well under way in our laboratory (C.Pasquier and J.S.Hamodrakas, in preparation). It is a neural network-based system which classifies proteins into four classes: fibrous (structural), globular, mixed (fibrous and globular) and membrane. The PRED-TMR algorithm has already been applied to the ORFs predicted from two genome projects and these results are currently being studied in detail.
PRED-TMR can certainly be improved by selecting carefully a representative and reliable set of transmembrane proteins to build the different tables. Ambiguities and errors in the existing databases impose limitations to its accuracy. When the statistical parameters used in the scoring formula were derived from the set of the 64 proteins, which were used to optimize the hydrophobicity cut-off, instead of calculating them from the entire SwissProt database, the accuracy scores decrease if the PRED-TMR algorithm is applied to sets larger than the original set of the 64 proteins. This is certainly due to the small reference set and reflects some special features of its sequences. However, it is believed that the most promising way to improve the accuracy of prediction is to alter the scoring formula. Indeed, it was found that the length penalty used is not the most appropriate because it handicaps too harshly segments with a length outside the [1725] range. Several other parameters can be added to the scoring formula such as the positive inside rule defined by von Heijne (1992). However, we are convinced that this kind of algorithm will always be limited by the problem of using a strict cut-off to the hydrophobicity indicator. Fuzzy logic seems to be a good technique to overcome this limitation by introducing some haziness in decision making.
A WWW server running the PRED-TMR algorithm is available at http://o2.db.uoa.gr/PRED-TMR/.
| Acknowledgments |
|---|
The authors gratefully acknowledge the support of the EEC-TMR `GENEQUIZ', grant ERBFMRXCT960019.
| Notes |
|---|
1 To whom correspondence should be addressed. E-mail: shamodr{at}atlas.uoa.gr
| References |
|---|
|
|
|---|
Aloy,P., Cedano,J., Olivia,B., Aviles,X. and Querol,E. (1997) CABIOS, 13(3), 231234.
Chou,P.Y. and Fasman,G.D. (1978) Adv. Enzymol., 47, 45148.
Cserzo,K., Wallin,E., Simon,I., von Heijne,G. and Elofsson,A. (1997) Protein Engng, 10, 673676.
Fisher,R.A. (1958). Statistical Methods for Research Workers. 13th edn. Hafner, New York, p. 183.
Jones,D.T., Taylor,W.R and Thornton,J.M. (1994) Biochemistry, 33, 30383049.[Medline]
Matthews,B.W. (1975) Biochim. Biophys. Acta, 405, 442451.[Medline]
Persson,B. and Argos,P. (1994) J. Mol. Biol., 237, 182192.[Web of Science][Medline]
Rost,B. and Sander,C. (1999). In Webster D.M. (ed.), Predicting Protein Structure. Humana Press, Clifton, NJ, in press. http://www.embl-heidelberg.de/rost/Papers/98revSecStr.html.
Rost,B., Casadio,R., Fariselli,P. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[Web of Science][Medline]
Rost,B., Casadio,R., Fariselli,P. and Sander,C. (1995) Protein Sci., 4, 521533.[Web of Science][Medline]
Rost,B., Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.[Web of Science][Medline]
Sipos,L. and von Heijne,G. (1993) Eur. J. Biochem., 213, 13331340.[Web of Science][Medline]
von Heijne,G. (1992) J. Mol. Biol., 225, 487494.[Web of Science][Medline]
Received September 29, 1998; revised January 22, 1999; accepted January 26, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
D. N. Amin, B. L. Taylor, and M. S. Johnson Topology and Boundaries of the Aerotaxis Receptor Aer in the Membrane of Escherichia coli J. Bacteriol., February 1, 2006; 188(3): 894 - 901. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. I. Lukhovitskaya, N. E. Yelina, A. A. Zamyatnin Jr, M. V. Schepetilnikov, A. G. Solovyev, M. Sandgren, S. Yu. Morozov, J. P. T. Valkonen, and E. I. Savenkov Expression, localization and effects on virulence of the cysteine-rich 8 kDa protein of Potato mop-top virus J. Gen. Virol., October 1, 2005; 86(10): 2879 - 2889. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Chatterjee, A. Richmond, E. Putiri, D. C. Shakes, and A. Singson The Caenorhabditis elegans spe-38 gene encodes a novel four-pass integral membrane protein required for sperm function at fertilization Development, June 15, 2005; 132(12): 2795 - 2808. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. R. Mathews, F. Wang, D. J. Eide, and M. Van Doren Drosophila fear of intimacy Encodes a Zrt/IRT-like Protein (ZIP) Family Zinc Transporter Functionally Related to Mammalian ZIP Proteins J. Biol. Chem., January 7, 2005; 280(1): 787 - 795. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Wright III, G. J. Lyon, E. A. George, T. W. Muir, and R. P. Novick Hydrophobic interactions drive ligand-receptor recognition for activation and inhibition of staphylococcal quorum sensing PNAS, November 16, 2004; 101(46): 16168 - 16173. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Dohke, Y. S. Oh, I. S. Ambudkar, and R. J. Turner Biogenesis and Topology of the Transient Receptor Potential Ca2+ Channel TRPC1 J. Biol. Chem., March 26, 2004; 279(13): 12242 - 12248. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Kernytsky and B. Rost Static benchmarking of membrane helix predictions Nucleic Acids Res., July 1, 2003; 31(13): 3642 - 3644. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ikeda, M. Arai, T. Okuno, and T. Shimizu TMPDB: a database of experimentally-characterized transmembrane topologies Nucleic Acids Res., January 1, 2003; 31(1): 406 - 409. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. E. Hudson, D. Zhang, and J. R. Nodwell Membrane Association and Kinase-Like Motifs of the RamC Protein of Streptomyces coelicolor J. Bacteriol., September 1, 2002; 184(17): 4920 - 4924. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Cserzo, F. Eisenhaber, B. Eisenhaber, and I. Simon On filtering false positive transmembrane protein predictions Protein Eng. Des. Sel., September 1, 2002; 15(9): 745 - 752. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Manning, C. Woodrow, F. A. Zuniga, P. Iserovich, J. Fischbarg, A. I. Louw, and S. Krishna Mutational Analysis of the Hexose Transporter of Plasmodium falciparum and Development of a Three-dimensional Model J. Biol. Chem., August 16, 2002; 277(34): 30942 - 30949. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Nawrath, S. Heck, N. Parinthawong, and J.-P. Metraux EDS5, an Essential Component of Salicylic Acid-Dependent Signaling for Disease Resistance in Arabidopsis, Is a Member of the MATE Transporter Family PLANT CELL, January 1, 2002; 14(1): 275 - 286. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Koh, A. M. Wiles, J. S. Sharp, F. R. Naider, J. M. Becker, and G. Stacey An Oligopeptide Transporter Gene Family in Arabidopsis Plant Physiology, January 1, 2002; 128(1): 21 - 29. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. D. Liakopoulos, C. Pasquier, and S. J. Hamodrakas A novel tool for the prediction of transmembrane protein topology based on a statistical analysis of the SwissProt database: the OrienTM algorithm Protein Eng. Des. Sel., June 1, 2001; 14(6): 387 - 390. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bahr, J. D. Thompson, J.-C. Thierry, and O. Poch BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations Nucleic Acids Res., January 1, 2001; 29(1): 323 - 326. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Pasquier and S.J. Hamodrakas An hierarchical artificial neural network system for the classification of transmembrane proteins Protein Eng. Des. Sel., August 1, 1999; 12(8): 631 - 634. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Shenai, P. S. Sijwali, A. Singh, and P. J. Rosenthal Characterization of Native and Recombinant Falcipain-2, a Principal Trophozoite Cysteine Protease and Essential Hemoglobinase of Plasmodium falciparum J. Biol. Chem., September 8, 2000; 275(37): 29000 - 29010. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. A. Zuniga, G. Shi, J. F. Haller, A. Rubashkin, D. R. Flynn, P. Iserovich, and J. Fischbarg A Three-dimensional Model of the Human Facilitative Glucose Transporter Glut1 J. Biol. Chem., November 21, 2001; 276(48): 44970 - 44975. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








