Protein Engineering vol. 16 no. 7 pp. 479-488, 2003
© 2003 Oxford University Press
Identification of transmembrane protein functions by binary topology patterns
1Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, Hirosaki 036-8561, Japan 2Present address: Graduate School of Humanity and Science, Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan
3 To whom correspondence should be addressed. e-mail: slsimi{at}si.hirosaki-u.ac.jp
| Abstract |
|---|
|
|
|---|
We propose a novel method for identifying and classifying the functions of transmembrane (TM) proteins based on their TM topology [the number of TM segments (tms), the loop length and the N-terminus location]. In this method, the TM topology is expressed as a string of 0 and 1, and this is designated the binary topology pattern (BTP). We focused on TM proteins with up to 12 tms, with the exception of 1 and 9 tms, and classified them into 37 functional groups by the number of tms and the functional annotation. These grouped TM protein sequences were used to determine BTPs which are specific to the individual functional groups. Since the evaluated accuracies (sensitivity, specificity and self-consistency) of these patterns in functional identification were quite high overall, i.e. 0.940, 0.934 and 0.935, respectively, as averaged over the 37 functional groups, we confirmed that TM protein function can be identified by the number of tms and the characteristics of loop lengths, i.e. BTPs.
Keywords: binary topology pattern/functional identification/loop length/transmembrane protein/transmembrane topology
| Introduction |
|---|
|
|
|---|
Recent studies have revealed that the fraction of transmembrane (TM) proteins in the proteome is almost constant,
2030%, irrespective of the diverse range of organisms and genome sizes (Jones, 1998
70% of them in individual TM proteomes still remain functionally unknown or are not yet well annotated (M.Arai and T.Shimizu, manuscript in preparation). This is a much higher rate than for soluble proteins, e.g. <30% in the case of Escherichia coli (Serres et al., 2001
At the same time, this rather simple structural feature is making the prediction of the secondary structure (TM topology, i.e. the number of tms + loop lengths + N-tail location) from the amino acid sequence an easier task for TM proteins than for soluble proteins. In this context, many TM topology prediction methods have been proposed so far (e.g. Claros and von Heijne, 1994
; Jones et al., 1994
; Rost et al., 1996
; Hirokawa et al., 1998
; Sonnhammer et al., 1998
; Tusnady and Simon, 1998
), although their prediction accuracy is not yet high enough (Moeller et al., 2001
; Chen et al., 2002
; Ikeda et al., 2002
). In order to obtain predictions of even higher accuracy practically, several consensus approaches have recently been tried by combining several of the proposed prediction methods (Promponas et al., 1999
; Nilsson et al., 2000
, 2002; Bertaccini and Trudell, 2002
; Ikeda et al., 2002
, 2003; Kall and Sonnhammer, 2002
).
One of the reasons why so much effort has been made in developing TM topology prediction methods is that there is a good possibility of classifying and identifying the functions of TM protein sequences from knowing their accurate TM topologies. For example, Tusnady et al. (Tusnady et al., 1997
) suggested that 12-tms ABC transporter proteins are characterized by a specific and common TM topology pattern and that TM topology pattern analysis may significantly help the search for characteristic domains, in addition to sequence comparisons. From their analysis of four-tms receptors and channel proteins, Clements and Martin (Clements and Martin, 2002
) recently proposed a new idea for the functional identification of TM proteins by searching for characteristic patterns in the hydropathy profiles. It has also been reported that the lengths of the intracellular second and fourth loops of G-protein coupled receptors (GPCRs) are short and their lengths are strongly conserved, while the intracellular sixth loop, whose length is quite long, has a large variation in its length (Otaki and Firestein, 2001
). The authors indicated the possibility of classifying the GPCR functions according to the loop lengths. From these findings, it seems that the TM topology has been conserved to preserve the function of the TM protein in the evolutionary process more rigorously than the amino acid sequence.
In this study, we propose a novel method for classifying/identifying TM protein functions based on the TM topology, i.e. the length characteristics of the loops. In this method, the length of each loop is expressed as 1 or 0, depending on whether it is longer or shorter, respectively, than the threshold length defined for each loop, and then the TM topology is treated as a string of 0 and 1, which is named the binary topology pattern (BTP).
| Materials and methods |
|---|
|
|
|---|
Functional groups of TM proteins
The data used in this study are TM protein sequences taken from SwissProt 38.0 (Bairoch and Apweiler, 2000
). Excluding the TM protein entries with a partly defined sequence (i.e. a fragment) and those with an unknown N-terminus location, we finally obtained 4348 sequences with numbers of tms from one to 30. We focused on 2097 entries with 212 tms, with the exception of nine tms, which were classified into 37 functional groups, including 10 others groups, according to the functional descriptions in the DE, CC or KW lines of the SwissProt database, as summarized in Table I. TM proteins with the annotation of a probable, putative or hypothetical function were only included in the others group.
|
As the lengths of the tms described in SwissProt were not constant and varied from entry to entry (i.e. from 10 to 30 residues), we normalized the tms length to 21 residues for all the tms, by expanding 10 residues in both the N- and C-terminal directions from the center position of original tms. The signal peptides were removed from the sequences in advance of the following analysis.
It should be noted here that 26 entries included in the three-tms glutamate receptor group (35 entries in total are contained, see Table I) are registered in SwissProt 38.0 originally as four-tms glutamate receptor. Following the reports that the second tms in previously proposed topology models does not span the membrane but is a membrane pore-lining loop (Hollmann et al., 1994
; Anand, 2000
), we decided to treat the 26 entries as a three-tms glutamate receptor in this study without changing the annotated N-tail location and segment positions for the remaining three segments.
The list of classified TM proteins used in this study is available at ftp://bioinfo.si.hirosaki-u.ac.jp/BTP/.
Binary topology pattern (BTP)
Consider an amino acid sequence belonging to a certain functional group of a TM protein with n-tms. Let li denote the length of the ith loop (1
i
n + 1). Here, l1 means the length of the N-tail loop. Next, we define the threshold length of the ith loop, lti, to be compared with li in order to assign a binary loop length, bi, to the ith loop by using the following criteria:
Here, 1 means that the loop is a long one, and 0 a short one. For example, for the case of a four-tms gap junction [gap junction protein CX32.2, SwissProt ID CX32_MICUN (Yoshizaki et al, 1994
)] with the loop lengths l = {18, 36, 55, 20, 71} residues, the binary loop lengths are determined as b = (0, 1, 1, 0, 1) with lt = {47, 30, 28, 80, 42} residues, as illustrated in Figure 1.
|
Next, we calculate the average binary loop length, a, for every functional group by averaging b across all the entries, i.e.
where N is the number of entries contained in the functional group. The lengths of the individual loops vary from sequence to sequence even within a single functional group, although the degree of variation is different from loop to loop, as realized in Figure 2. The average binary loop lengths of the first loop (N-tail loop) and the second loop (12 loop) change rapidly from 1.0 to 0.0 with an increase of lti in narrow ranges of
20 and
35 residues, respectively, indicating that the loop lengths are quite close to each other. On the contrary, the lengths of the third loop (23 loop), fourth loop (34 loop) and fifth loop (C-tail) are much more divergent, the fifth loop in particular.
|
Then, we calculate the root mean square (r.m.s.) difference, di, of the ith average binary loop length among the functional groups, by using the following equation:
where m is the number of the functional groups with n-tms, and api and aqi are the ith average binary loop lengths of functional groups p and q, respectively. In Figure 3, the relationships of the r.m.s. difference, di, versus the threshold length, lti, are shown for individual loops. For respective loops, the threshold length giving the maximum value of the r.m.s. difference is considered to be the optimum threshold length, at which the average binary loop lengths calculated for the respective groups are expressed most discriminatively with each other. The threshold lengths were obtained, in this example, as 4450, 2931, 2729, 80 and 42 residues for the first, second, third, fourth and fifth loops, respectively. For the first, second and third loops of which threshold lengths were not determined uniquely, we adopted the average value of these lengths as appropriate for the optimum threshold length. It seems to be a proper treatment, since a was calculated uniquely without any changes with varying threshold lengths within these ranges (4450, 2931 and 2729 residues) obtained for the three loops. It is not the case for the fourth and fifth loops that have unique threshold lengths determined. Only a small deviation (even one residue) from the obtained threshold lengths (i.e. 80 and 42 residues, respectively) alters the average binary loop lengths, a explicitly. When we take 39 or 43 residues (instead of 42) as the threshold length for the fifth loop, for example, a5 becomes 0.99 or 0.92 (instead of 0.94 for 42 residues). Thus, 47, 30, 28, 80 and 42 are obtained as the optimum threshold lengths for individual loops in the ensemble of the four-tms functional groups, and a for the gap junction group, for example, is calculated as (0.00, 1.00, 1.00, 0.00, 0.94) with lt = {47, 30, 28, 80, 42}.
|
As realized in the example of a above, each ai value is not necessarily just 0.0 or 1.0, i.e. it can be larger than 0.0 and smaller than 1.0. Then, we define the permission,
(0
< 0.5), with which the average binary loop lengths, a, are binarized to obtain the BTP, p, by applying the following criteria:
where * is the wild card meaning that the binary loop length is not defined for the ith loop. When we set the value of
to 0.01, for example, the BTP, p, for the gap junction group becomes 0110*.
An appropriate value of
should be assigned to each functional group so that the obtained BTP can have the maximal self-consistency of identification of its relevant function fulfilled. The self-consistency of the functional identification by the BTP, Sc, is defined as the geometric mean of the sensitivity, Sn, and the specificity, Sp:
Here, the sensitivity and the specificity are the ratios of the correctly identified entries to the total entries in the group and to the total predicted entries across the functional groups with the same n-tms, respectively.
Figure 4 shows how Sc varies with change of
for the case of the four-tms ensemble. The self-consistencies increase at first in a range of small
, and then decrease with increasing
value, except for the receptor group. It is reasonable to employ the smallest value of
as the appropriate one for each functional group. Thus, the values of
determined for receptor, gap junction and others groups are 0.04, 0.01 and 0.16, respectively, which give the maximum values of Sc to their corresponding BTPs: 10010, 0110* and 0*0**, respectively. We note that all the patterns thus obtained are exclusive of one another: the binary digit is discrepant in four positions (except for the last position) between receptor and gap junction, in the first position between receptor and others, and in the third position between gap junction and others. The BTPs determined are expected to be exclusive of each other with these lt and
values so that the individual patterns can identify the corresponding functional groups discriminatively from each other. This means that the appropriate BTPs are determined successfully with these parameter values.
|
Since the number of members contained in each functional group differs from group to group in each ensemble of the same n-tms, largely in some cases, e.g. in the seven-tms ensemble, in particular, it might be an unfair evaluation to use the original group sizes themselves in the calculation of Sp and Sc. The treatment on a equal-membership basis (i.e. percentage basis) should be adopted in the calculation of the accuracies.
| Results and discussion |
|---|
|
|
|---|
The BTPs obtained for the individual functional groups and the identification accuracies are summarized in Tables IIXI, together with the determined values of the parameters, lt and
. Two-tms TM proteins
As seen in Table II, the five functional groups, including others, are discriminated from each other with high accuracies (0.938, 0.929, 0.779, 0.790 and 0.701 for potassium channel, sodium channel, receptor, sensor protein and others, respectively) by using the obtained BTPs. The obtained BTPs, even for the others group, are exclusive of one another in at least one digit position. The first position distinguishes the two channel groups from receptor and sensor protein, indicating that the first loops are long (i.e.
41 residues) for both the channels and short (<41 residues) for receptor and sensor protein. With the channels, the second and third loops characterize both the types complementarily: short (<151 residues, 0) and long (
209 residues, 1) for the potassium channel, and long (
151, 1) and short (<209, 0) for the sodium channel. Similarly, the second loop makes a distinction between receptor and sensor protein: long (
151 residues, 1) for the former and short for the latter (<151, 0).
|
The pattern of the others group has a rather lower accuracy of 0.701 (sensitivity, 0.514 and specificity, 0.957) compared with the other patterns. We do not, however, need to use the others pattern for its function identification in actual applications, since we can put the two-tms TM protein sequences, not identified by any of the patterns of potassium channel, sodium channel, receptor or sensor protein into the others group, without applying the others pattern itself.
Three-tms TM proteins
The BTPs obtained for the four functional groups, except for the others, are exclusive of each other and identify their respective sequences with quite high self-consistencies: glycoprotein, 0.966; glutamate receptor, 1.000; fumarate reductase, 0.957; kinase, 0.890, as shown in Table III. The exclusive digits are the first position with 1 for glycoprotein and glutamate receptor, and 0 for fumarate receptor, kinase and others. Thus, we could successfully perform functional identification of three-tms TM proteins using the obtained patterns, except the others group. The BTP obtained for glutamate receptor gives the perfect identification accuracy, with all the loops being long, in particular, the third loop, which is distinct from other groups with a short third loop. Similar to the case of two-tms TM proteins, we do not need to use the others pattern to identify others protein sequences in this case.
|
Four-tms TM proteins
As shown in Table IV, the BTP of receptor identifies only the sequences of the receptor group with high sensitivity, 0.954, and specificity, 1.000 (self-consistency, 0.977). It should be noted that 146 entries identified by the receptor pattern belong to the ligand-gated ionic channels family with the N-out location, while the seven other entries do not. With the gap junction group, the obtained pattern identifies all the gap junction entries correctly, and only one others entry, in error. Furthermore, the identification accuracy of others is also still high enough, 0.904 (0.902), in contrast with the low accuracy in the cases of two-tms and three-tms TM proteins. The obtained patterns are exclusive to each other with these lt and
values, so that the individual patterns can identify the corresponding functional groups discriminatively from each other.
|
Five-tms TM proteins
In the five-tms transporter data set, various kinds of transporters are included, such as triose phosphate/phosphate translocator, cytochrome o ubiquinol oxidase subunit III, histidine transport system permease protein, etc., and there is a wide variety in the length of each loop, except for the fifth and sixth loops. This is reflected in the obtained pattern for the transporter group, in that only these two positions have a defined binary loop length and the others do not. Nevertheless, we can classify the five-tms protein sequences into two groups, transporter and others with high enough accuracies, 0.964 and 0.966, respectively, as shown in Table V.
|
Six-tms TM proteins
The BTPs obtained with lt = {100, 15, 24, 11, 14, 38, 72} for the three six-tms functional groups show that channel, MIP channel and transporter are exclusive of each other, and their self-consistencies are 0.894, 0.934 and 0.849, respectively (Table VI). The MIP channel and transporter patterns each identify only one others entry, even though both patterns are not explicitly exclusive to the others pattern. By comparing the MIP channel and channel patterns, it is realized that not only the long N-tail but also the long 45 and C-tail loops distinguish channel from MIP channel. Since the performance of the others pattern is not high enough, it is not necessary to actually use this pattern in the six-tms case as well.
|
Seven-tms TM proteins
All the BTPs obtained are exclusive of one another, except for the cases between GPCR class A and others, and rhodopsin pump and others (Table VII). Except for the others pattern, the accuracies of the obtained patterns are quite high, for class C, class E and rhodopsin pump, in particular, which identify themselves perfectly without identifying any entries of other groups. Here, using 22 GPCR sequences which are registered in SwissProt 38.0 but were not used for determining the patterns, we tested the functional identification performance of the obtained patterns. Applying the class A pattern to these sequences, we identified 19 entries as GPCR class A, which are Burkitts lymphoma receptor, chemokine receptor-like protein, olfactory receptor-like protein, etc. Out of these 19 sequences, we confirmed 13 sequences that belonged to GPCR class A. The class B pattern identified two sequences, which are glucagon-like peptide 1 receptor precursor of GPCR class B.
|
Eight-tms TM proteins
Similar to the five-tms case, the eight-tms transporter group is a mixture of various kinds of transporters, such as calcium-transporting ATPase, potassium-transporting ATPase, renal sodium-dependent phosphate transporting protein, etc. As a result of this, the obtained BTPs are not exclusive of each other and are rather ambiguous. The discrimination ability of the transporter pattern, however, of 0.874 is still high enough, as depicted in Table VII, since 52 transporters out of 68 sequences are picked up by this pattern.
10-tms TM proteins
As illustrated in Table IX, the BTPs for ATPase, transporter, exchanger and others groups have high self-consistencies, 1.000, 0.949, 0.966 and 1.000, respectively. This result means that we can accurately classify 10-tms TM proteins into four functional groups, at least. Even looking at the patterns in Table IX, we can understand that each group has its special features for the lengths of the loops. For example, we observe that almost all the odd number loops of ATPase are long, except for the last one. In particular, the 45 loop is longer than 199, and such a long loop is not shown in the other 10-tms TM proteins. The transporter has short N-tail and 23 loops, and these characteristics are exclusive of ATPase. For exchanger, we determined the pattern at all positions, except the 67 and 89 loops, in spite of the small permission value.
|
11-tms TM proteins
By using the obtained BTPs, 11-tms TM protein sequences can be classified into two functional groups, exchanger and others with perfect accuracies, as seen in Table X. The 11-tms exchanger TM proteins are characterized by an extremely long sixth (56) loop and quite short seventh (67) and eighth (78) loops.
|
12-tms TM proteins
The BTPs for sodium transporter, sugar transporter and ABC transporter have 0.984, 0.949 and 0.923 sensitivity and 0.867, 0.838 and 1.000 specificity, respectively, as shown in Table XI. The three transporter patterns are exclusive of each other and identified only a few others entries. We note that the sugar transporter and sodium transporter patterns identified a fair number of entries of the others group in error (i.e. 13 and 8 entries, respectively). It seems that a number of transporter sequences are included in SwissProt without being given a functional annotation of the transporter.
|
Conclusions
Taken together, the obtained BTPs have high accuracies for consistently identifying the entries of individual functions: the sensitivity, specificity and self-consistency are 0.898, 0.897 and 0.893, respectively, averaged over the 37 functional groups including the others group, and 0.940, 0.934 and 0.935, respectively, over the 27 functional groups without the others group.
We did not use the information of the N-tail location in this methodology, as some functional groups contain both entries with different N-tail locations, although it is only a small fraction. Incorporating the N-tail location information into the BTP, after improving the prediction performance of the N-tail location, may help to further improve the ability of BTPs in functional classification/identification.
As seen in Table I, some functional groups, i.e. the four transporter groups and the two-tms receptor group comprise both eukaryotic and prokaryotic sequences. Nevertheless, the individual BTPs determined for these groups exhibit quite high identification accuracies, indicating that the TM topologies with the same function have been well conserved between prokaryotes and eukaryotes.
We did not deal with single-spanning TM proteins in this study. Since only four BTPs, at most, are available for the case of single-spanning TM protein, it is too small to classify all of the single spannings. This will be overcome, however, by applying this method in a stepwise manner, where classification into a few unified groups is performed at first, followed by subdivision into several lower-level subgroups within the individual upper-level groups. This stepwise approach is also applicable successfully to the functional classification of multi-spannings that have a deep hierarchical class structure, such as GPCR (Y.Inoue and T.Shimizu, manuscript in preparation).
Finally, we would like to point out that the TM topology pattern is available not only for functional classification/identification, but also for picking out the loops that seem to make the functional differences among the groups in the ensemble with the same n-tms, as already mentioned.
| Acknowledgements |
|---|
This research was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas (C) Genome Information Science (No. 15014203) and a Grant-in-Aid for Scientific Research (C) (No. 14580665) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
|
| References |
|---|
|
|
|---|
Anand,R. (2000) Biochem. Biophys. Res. Commun., 276, 157161.[CrossRef][Web of Science][Medline]
Arai,M., Ikeda,M. and Shimizu,T. (2002) Gene, 304, 7786.
Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 4548.
Bertaccini,E. and Trudell,J.R. (2002) Protein Eng., 15, 443453.
Chen,C.P., Kernytsky,A. and Rost,B. (2002) Protein Sci., 11, 27742791.[CrossRef][Web of Science][Medline]
Claros,M.G. and von Heijne,G. (1994) Comput. Appl. Biosci., 10, 685686.
Clements,J.D. and Martin,R.D. (2002) Eur. J. Biochem., 269, 21012107.[Web of Science][Medline]
Hirokawa,T., Boon-Chieng,S. and Miraku,S. (1998) Bioinformatics, 14, 378379.
Hollmann,M., Maron,C. and Heinemann,S. (1994) Neuron, 13, 13311343.[CrossRef][Web of Science][Medline]
Ikeda,M., Arai,M., Lao,D.M. and Shimizu,T. (2002) In Silico Biol., 2, 1933.[Medline]
Ikeda,M., Arai,M., Okuno,T. and Shimizu,T. (2003) Nucleic Acids Res., 31, 406409.
Jones,D.T. (1998) FEBS Lett., 423, 281285.[CrossRef][Web of Science][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1994) Biochemistry, 33, 30383049.[CrossRef][Medline]
Kall,L. and Sonnhammer,E.L.L. (2002) FEBS Lett., 532, 415418.[CrossRef][Web of Science][Medline]
Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,E.L.L. (2001) J. Mol. Biol., 305, 567580.[CrossRef][Web of Science][Medline]
Liu,J. and Rost,B. (2001) Protein Sci., 10, 19701979.[CrossRef][Web of Science][Medline]
Mitaku,S., Ono,M., Hirokawa,T., Boon-Chieng,S. and Sonoyama,M. (1999) Biophys. Chem., 82, 165171.[CrossRef][Web of Science][Medline]
Moeller,S., Croning,M.D.R. and Apweiler,R. (2001) Bioinformatics, 17, 646653.
Nilsson,J., Persson,B. and von Heijne,G. (2000) FEBS Lett., 486, 267269.[CrossRef][Web of Science][Medline]
Nilsson,J., Persson,B. and von Heijne,G. (2002) Protein Sci., 11, 29742980.[CrossRef][Web of Science][Medline]
Otaki,J.M. and Firestein,S. (2001) J. Theor. Biol., 211, 77100.[CrossRef][Web of Science][Medline]
Promponas,V.J., Palaios,G.A., Pasquier,C.M., Hamodrakas,J.S. and Hamodrakas,S.J. (1999) In Silico Biol., 1, 159162.[Medline]
Rost,B., Casadio,R. and Fariselli,P. (1996) In States,D.T., Agarwal,P., Gaasterland,T., Hunter,L. and Smith,R.F. (eds), Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 192200.
Serres,M.H., Gopal,S., Nahum,L.A., Liang,P., Gaasterland,T. and Riley,M. (2001) Genome Biol., 2, research0035.10035.7.
Sonnhammer,E.L., von Heijne,G. and Krogh,A. (1998) In Glasgow,J., Littlejohn,T., Major,F., Lathrop,R., Sankoff,D. and Sensen,C. (eds), Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 175182.
Stevens,T.J. and Arkin,I.T. (2000) Proteins: Struct. Funct. Genet., 39, 417420.[CrossRef][Web of Science][Medline]
Tusnady,G.E. and Simon,I. (1998) J. Mol. Biol., 283, 489506.[CrossRef][Web of Science][Medline]
Tusnady,G.E., Bakos,E., Varadi,A. and Sarkadi,B. (1997) FEBS Lett., 402, 13.[CrossRef][Web of Science][Medline]
Wallin,E. and von Heijne,G. (1998) Protein Sci., 7, 10291038.[Web of Science][Medline]
Yoshizaki,G., Patino,P. and Thomas,P. (1994) Biol. Reprod., 51, 493503.[Abstract]
Received December 28, 2002; revised May 31, 2003; accepted June 8, 2003.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Arai, H. Mitsuke, M. Ikeda, J.-X. Xia, T. Kikuchi, M. Satake, and T. Shimizu ConPred II: a consensus prediction method for obtaining transmembrane topology models with high reliability Nucleic Acids Res., July 1, 2004; 32(suppl_2): W390 - W393. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




