Protein Engineering, Vol. 12, No. 1, 3-9,
January 1999
© 1999 Oxford University Press
REVIEW |
Machine learning approaches for the prediction of signal peptides and other protein sorting signals
Center for Biological Sequence Analysis Department of Biotechnology, The Technical University of Denmark, DK-2800 Lyngby, Denmark and 2 Department of Biochemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden
| Abstract |
|---|
|
|
|---|
Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently, the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, have made it possible to achieve a level of reliability where practical use in, for example automatic database annotation is feasible. In this review, we concentrate on the present status and future perspectives of SignalP, our neural network-based method for prediction of the most well-known sorting signal: the secretory signal peptide. We discuss the problems associated with the use of SignalP on genomic sequences, showing that signal peptide prediction will improve further if integrated with predictions of start codons and transmembrane helices. As a step towards this goal, a hidden Markov model version of SignalP has been developed, making it possible to discriminate between cleaved signal peptides and uncleaved signal anchors. Furthermore, we show how SignalP can be used to characterize putative signal peptides from an archaeon, Methanococcus jannaschii. Finally, we briefly review a few methods for predicting other protein sorting signals and discuss the future of protein sorting prediction in general.
| Introduction |
|---|
|
|
|---|
Subcellular protein sorting, i.e. the processes through which proteins are routed to their proper final destination within a cell, is a fundamental aspect of cellular life. In many cases, sorting depends on `signals' that can already be identified by looking at the primary structure of a protein. Thus, targeting to the secretory pathway, to mitochondria and to chloroplasts normally depends on an N-terminal presequence or targeting peptide that can be recognized by receptors on the surface of the appropriate organelle. After targeting, membrane-embedded translocation machineries ensure the delivery of the protein to the interior of the organelle.
By definition, the cell can recognize all kinds of protein sorting signals with almost 100% selectivity and specificitythe level of mis-sorting in vivo appears to be very low, although this aspect of the problem has not been studied in detail. Given that the sorting signals mentioned above seem to be, at least to a good approximation, defined by a linear, N-terminal stretch of the polypeptide, it would appear that we should be able to devise sequence-based methods that can recognize these signals with an efficiency approaching that of the cell itself. If such methods can be developed, they will clearly be of major use for genome analysis and automatic database annotation; at the same time, these massive data analysis tasks necessitate very accurate prediction methods.
While prediction of sorting signals has a long history, started by the early work on secretory signal peptides (von Heijne, 1983
; McGeoch, 1985
; von Heijne, 1986b
), it is only with the application of modern machine learning techniques, such as neural networks (NNs) and hidden Markov models (HMMs), that we seem to be approaching the necessary levels of accuracy (Baldi and Brunak, 1998
; Durbin et al., 1998
). Machine-learning techniques are ideally suited for pattern recognition tasks where relatively large amounts of data are present and where the patterns are `noisy' and not easily described by a compact set of rules. The fundamental idea behind these approaches is to learn to discriminate automatically from the data, using experimentally verified examples, which most often are extracted from large public sequence and structure databases. While HMMs are best at recognizing, in an `elastic' fashion, the sequential pattern in the amino acids or nucleotides, the NN algorithms are better at handling sequence features correlated over a longer range, especially if there is some degree of conservation in the positioning of the relevant features. Together, the NN and HMM methods can therefore handle a very substantial part of the sequence diversity created by evolution that is characteristic for many complex biological mechanisms. Thus, there now exist quite reliable machine learning-based methods for the identification of both secretory signal peptides (SPs), mitochondrial targeting peptides (mTPs) and chloroplast transit peptides (cTPs).
In this review, we will concentrate on the present status and future perspectives of SP predictionin particular the developments and applications of our own method, SignalP, since it was published in Protein Engineering two years ago (Nielsen et al., 1997a
). Several NN-based methods for prediction of SPs have been developed (Ladunga et al., 1991
; Schneider and Wrede, 1993
), but only SignalP is publicly available. SignalP has been used extensively since it was made available over the internet, but the first version has some important shortcomings that necessitate further development and integration with other prediction methods. In addition, we will review a couple of methods for predicting other protein sorting signals, and discuss some general aspects of sorting signal prediction.
| Constructing the training set for machine learning methods |
|---|
|
|
|---|
While different algorithms within the broad range of machine learning methods available will have different advantages in terms of their pattern recognition abilities, they are all driven by the data used to train them. The selection of the training set is arguably the most important part in the construction of a prediction method. No matter how sophisticated the algorithm, with poor training data one will get poor results. In the cases discussed here, SWISS-PROT (Bairoch and Apweiler, 1997
Another problem is that a sequence database always contains numerous examples of genes belonging to gene families and homologous genes from various organisms. This can lead to statistical results that are biased for the over-represented sequences, and the performance of prediction methods will be overestimated if the test set contains sequences closely related to those used in the training. Thus, after selecting an initial set of sequences from SWISS-PROT, one has to remove homologous sequences (unless the training algorithm can deal with redundant data sets) using, for example, the Hobohm redundancy reduction method (Hobohm et al., 1992
). The question of when two sequences are `too closely related' to be kept within the reduced data set is far from trivial. For the SignalP data set, the similarity threshold is found from the principle that if it is possible to infer the position of the cleavage site in one SP by alignment to another SP, the sequences are too similar. Another approach, which uses the statistical theory of local alignments (Altschul and Gish, 1996
), is to fit the alignment scores to an extreme value distribution and choose a threshold value above which there are more observations than expected from the distribution (Pedersen and Nielsen, 1997
).
Unless the remaining set at this point is prohibitively large, it should be checked by hand against the primary publications. In our experience, features like cleavage sites for sorting signals are not always correctly annotated: sites not listed as `putative' may in fact be based only on an informed guess (or even an existing prediction method), and experimentally verified sites are sometimes incorrectly entered into the database (database `typos'). In a recent study of chloroplast transit peptides (O.Emanuelsson, H.Nielsen and G.von Heijne, manuscript submitted), we had to remove around 10% of the sequences in our homology-reduced data set for such reasons. Even experimentally verified data may be wrong if the interpretation of the results has been faulty. The most relevant example in this context is that an N-terminus of a mature protein, confirmed by amino acid sequencing, might derive not from cleavage by the signal peptidase but from a subsequent cleavage by another protease in the secretory pathway.
If the data set is too large to allow for manual inspection of all entries, some suspicious looking examples may be identified by automated methods. One possibility is to use alignments of the unreduced set to single out pairs of sequences that show a very high similarity but discrepancies in assignment of subcellular location or cleavage site position (Nielsen et al., 1996
). Another method is to use the training algorithm itself to pick out cases which are more difficult to learn than others (Brunak, 1993
). Both these approaches are necessarily biased; the first will never be able to pick up errors in sequences with no matching homologues, and both can fail to recognize systematic errors that occur in several entries. Still, experience has shown that machine learning methods can serve as extremely useful tools for data set validation; in several cases, NNs have been able to detect errors caused both by simple misprints and by incorrect interpretation of experiments (Brunak et al., 1990a
,b
).
Another aspect of the choice of training set is whether sequences from all, some subset of, or only a single organism should be included. If there is enough data, organism-specific methods should be expected to perform better than more general ones, but in most cases it is not possible to be this restrictive.
In the SignalP work, we trained two species-specific versions on human and Escherichia coli SPs, and concluded that there was no significant gain in performance when testing with networks trained on a single-species data set relative to networks trained on larger groups (Nielsen et al., 1997a
). This result is not definitive, however. The reason why the E.coli-specific network did not show an improvement compared with one trained on a larger set of Gram-negative SPs might simply be that the E.coli set at that time was too small to achieve the same relative performance. Regarding the human-specific network, one should note that the eukaryotic set is dominated by mammals, i.e. rather close relatives to humans; and we cannot exclude the possibility that signal peptides from, for example, yeast (which are relatively underrepresented in the data set), are significantly different from those of mammals. Nevertheless, genomic sequencing opens up the possibility of constructing species-specific versions of the basic algorithm, perhaps by a bootstrapping procedure where a more general version trained on, for example, all eukaryotic sequences, is used to extract an initial set of reliably predicted sequences from, for example, yeast, which is then used to iteratively train a species-specific version.
| Current status of the SignalP method |
|---|
|
|
|---|
SignalP is a typical example of a NN-based method, and three versions trained on different data sets (eukaryotes, Gram-negative and Gram-positive bacteria) are available. These three versions reflect significant differences in the characteristics of signal peptides from these groups of organisms, and each gives a better performance than a method trained on all groups together. They also provide the opportunity to test the efficiency of a given signal peptide sequence in a non-native host. For example, a human sequence can be analysed by the Gram-positive version of the method and thus give an indication of how effective the sequence will appear in a production organism, say, Bacillus subtilis. If it appears to have a low degree of `signal peptide-ness' in the new host, it can subsequently be engineered such that the SP sequence will optimally match the N-terminus of the mature protein.
SignalP combines two different NNs, one that has been trained to classify each residue in the sequence as either belonging or not belonging to a SP (S-score), and one that has been trained only to recognize the site at the C-terminal end of the SP that is cleaved by the signal peptidase enzyme after targeting (C-score). Cleavage-site prediction performance is significantly enhanced by penalizing C-score peaks that are far away from the transition region between the SP and the mature polypeptide identified by the S-score. This is formalized by using the `Y-score', a geometric average of the C-score and a numerical derivative of the S-score. In the example shown in Figure 1
, the C-score has two peaks, where the upstream one is slightly higher but the downstream one occurs in the transition zone of the S-score and therefore has a higher Y-score.
|
A prediction for the existence of a SP can be made by the maximal value of the C-, S- and Y-scores, or the mean S-score between the N-terminus and the predicted cleavage site. Of these, the maximal Y-score or the mean S-score give the best discrimination performance, but all four values are reported in the output. A more thorough description of the SignalP architecture and the definition of the various measures can be found elsewhere (Nielsen et al., 1997b
The performance values of SignalP are shown in Table I
, both for the original version and for a version retrained on a new data set, based on SWISS-PROT release 35 instead of 29. Note that the performance for cleavage site location has improved. Since the old and new data sets are extracted by the same method, and the sizes have changed only slightly, the most probable explanation for the improvement is that the quality of SWISS-PROT annotations concerning SPs are better in the newer version.
|
There are two important points to be made about the performance values. One the one hand, they should be regarded as minimal, because they are test set performances (averaged over five cross-validation sets), where the homology reduction of the data has assured that the similarity between training and test sets is so low that the correct cleavage sites cannot be found by alignment (Nielsen et al., 1996
On the other hand, the performance values given in Table I
are calculated under two limiting assumptions: that the correct N-terminus of the protein in question is known, and that the sequence does not contain an N-terminal transmembrane helix. The data sets on which SignalP is trained and tested contain only the N-terminal part (up to 70 amino acids) of each protein, and transmembrane proteins were not included in the negative set. The decision to use only the N-terminal part of each protein was based on the idea that SignalP should reproduce the recognition task met by the cell in vivo, where SP cleavage takes place only within a certain range from the N-terminus. The reason for the lack of transmembrane helices in the negative set is more practical: it is very hard to ensure that there is experimental evidence for absence of cleavage of a transmembrane protein. For a subset of transmembrane proteins, however, we have a reliable set: eukaryotic signal anchors (see below).
These two points constitute a problem for the application of SignalP to genome and EST data. As an illustration of this, the scanning of the Haemophilus influenzae genome which we reported in the SignalP paper (Nielsen et al., 1997a
) produced a remarkably large variation in the estimate of the proportion of proteins with SPs: from 14% if using the maximal Y-score as discriminator, to 28% when using the maximal S-score, even though all these measures give high discrimination performances when used on the SignalP data set. This means that the performance of (at least) one of these measures is considerably lower when applied to genome data; and that SignalP, when used for this purpose, should ideally be combined with a transmembrane helix prediction and a start codon prediction.
| SignalP-HMM: distinguishing signal peptides from signal anchors |
|---|
|
|
|---|
Some proteins have sequences that initiate translocation in the same way as SPs do, but are not cleaved by signal peptidase (von Heijne, 1988
-helixand the region N-terminal of the hydrophobic stretch can also be much longer. Interestingly, experiments have shown that it is possible to convert a cleaved SP into an uncleaved SA merely by lengthening the hydrophobic region (Chou and Kendall, 1990
The discrimination between SAs and SPs has proved to be very difficult for the neural network: approximately 50% of the SAs are predicted as SPs according to the mean S-score. Since both the C-score and the S-score are calculated from sequence windows of a limited width, a feature such as region length is difficult to represent in the input. To solve this problem, we have developed SignalP-HMM, a HMM architecture for SPs and SAs (Figure 2
).
|
The advantage of the HMM method in this context is that it does not use windows of a fixed width, but threads an entire sequence through a trained model. An HMM is a chain of `states', each with a characteristic amino acid distribution, with transitions that specify possible orders of states. Thus, a HMM can model sequences of varying length by transitions that skip or repeat states. By assigning states to known regions of the signal to be modeled, biological knowledge can be built into the HMM.
Secretory signal peptides have three distinct regionsan N-terminal positively charged n-region, a central hydrophobic h-region, and a C-terminal c-region encompassing the signal peptidase cleavage site (von Heijne, 1985
). Each of these is represented by a separate part of the model: the n- and h-regions are modeled in a simple way, with all states having the same amino acid frequencies, while the region around the cleavage sites is modeled in more detail (essentially like a weight matrix). Signal anchors have both an n- and an h-region, and no cleavage site. By having two parallel submodels of the HMM, it is possible to represent differences in both length distribution and amino acid frequencies between the nand h-regions of SPs and SAs. A third branch (actually, just a shortcut) is added to represent those sequences that are neither SPs nor SAs. When threading a sequence through this model, one of the three branches is chosen, and this serves as the prediction of protein type. Additionally, this method provides an objective way to delineate the n-, h- and c-regions in a SP, and it may thus be used to compare the overall design of SPs from different organisms.
SignalP-HMM is able to discriminate between SPs and SAs with a correlation coefficient of 0.74 (see Table I
)far from perfect, but much better than with the NNs. In a sense, this comparison is not quite fair, because the SAs were not used explicitly as negative examples during training of the NN, but this would have been problematic given the small size of the SA set. With the HMM, it is easy to take this limitation into account by using a simpler submodel (with a smaller number of free parameters) in the SA branch than in the SP branch. Regarding the identification of SPs versus soluble non-secretory proteins, the HMMs perform on a par with the NNsand for Gram-negative bacteria even betterbut they are less accurate for cleavage site prediction, see Table I
.
Type II membrane proteins constitute only a minor fraction of transmembrane proteins. When scanning genome data, it is desirable to distinguish SPs not only from SAs, but also from other types of transmembrane helices. It is advisable to combine SignalP with one of the available prediction methods for transmembrane helices, e.g. PHDhtm (Rost et al., 1996
) or TopPred (von Heijne, 1992
). Of course, it would be preferable, both for usage on large data sets and from a theoretical point of view, to obtain one prediction of the presence and location of both SPs and transmembrane helices in the sequence. To this end, we plan to build an integrated HMM architecture based on SignalP-HMM and an HMM-based transmembrane helix prediction method, TMHMM (Sonnhammer et al., 1998
).
| Start codon prediction |
|---|
|
|
|---|
A difficulty for prediction of SPsor any other N-terminal sorting signalsis that the position of the N-terminus in the preprotein is rarely known experimentally. This is particularly troublesome when using genomic data, where protein coding regions are predicted by gene finding algorithms containing numerous potential sources of error. Wrong start codon assignments can produce false negatives, since the resulting sequence may either contain only a partial SP sequence, or a SP plus a stretch of irrelevant amino acid sequence (derived from DNA which is untranslated in vivo) without SP characteristics.
For expressed sequence tags (ESTs) the problem can be even worse, since it is very difficult to decide whether a given sequence includes the start codon at allit might be entirely untranslated, or correspond to an internal stretch of a protein. The last case can also produce false positive predictions, since non-cytoplasmic ends of transmembrane helices are often rather similar to SP cleavage sites, and the SignalP networks have never been trained to avoid SPs here.
Therefore, it would be desirable to have a method which, given a nucleotide sequence, would provide a prediction of both ends of a SP, i.e. the start codon and the cleavage site. Such a method does not exist yet, but a partial solution would be a score describing the probability that any given triplet is the start codon. To this end, we have developed a NN-based method for start codon prediction in eukaryotes, NetStart (Pedersen and Nielsen, 1997
). It is trained to recognize the start codon AUG against all other AUG triplets in the mRNA sequence. It performs this task by using both local contextthe Kozak box (Kozak, 1984
)and long-range context in the form of implicit reading frame detection. NetStart is designed to work with EST or cDNA data; for use with genomic DNA, the possible occurrence of introns shortly downstream of the start codon could be detrimental to the prediction.
Statistical analyses (A.G.Pedersen et al., manuscript in preparation) have shown that the local start codon context varies widely between different systematic groups of eukaryotes. The current NetStart 1.0 contains only two organism-specific versions, for vertebrates and Arabidopsis thaliana, but more will be added in future releases. Although NetStart 1.0 should be regarded as a `first attempt' at this problem, it does show test set performances, measured by correlation coefficient, of 0.62 for vertebrates and 0.71 for A.thaliana.
| Signal peptides of Archaea |
|---|
|
|
|---|
Secretory SPs from eukaryotes and bacteria are well described, but only very few experimental examples are known from the third domain of life, the archaea (formerly known as archaebacteria). Although being prokaryotic, they show greater similarity in many respects to eukaryotes than to bacteria, especially concerning informational cellular processes such as replication and translation (Olsen and Woese, 1997
We used a `consensus' between the three SignalP versions in a first attempt at characterizing the SPs of Methanococcus jannaschii, the first archaeon to be completely sequenced (Bult et al., 1996
). SPs should indeed be expected in this organism: a signal peptidase has been identified by homology in the genome, and it shows greater homology to its eukaryotic than to its bacterial counterpart. The underlying idea is that if we are able to find sequences in the genome which could function as SPs in all other domains of life (i.e. in eukaryotes and both groups of bacteria), they would presumably function as signal peptides in M.jannaschii as well.
Methanococcus jannaschii SPs might have been predicted by alignment to known SPs from other organisms, if significant matches to experimentally verified secretory proteins including the SP region could be found. We made local pairwise alignments between all the predicted M.jannaschii protein sequences and all sequences in the SignalP data set, but found only insignificant matches. Even the best pairwise alignment scores were considerably lower than the threshold required for using a local alignment of two SP sequences to predict the location of the cleavage site (Nielsen et al., 1996
). This shows that we cannot expect to find M.jannaschii SPs by alignmenta prediction method is indeed necessary for this task.
We selected sequences where both the maximal Y-score and the mean S-score were above their cut-off values for all three SignalP versions (eukaryotic, Gram-positive and Gram-negative). This is a very conservative criterion: when tested on the SignalP data sets, it accepts 75% Gram-negative, 66% Gram-positive and only 39% of the eukaryotic SPs. Used on the M.jannaschii genome, it yielded 34 putative SPs, none of which had a known subcellular location. This number is too small to train a species-specific neural network (it might be used for an HMM but this has not yet been implemented), but it is enough to draw a few tentative conclusions about M.jannaschii SPs.
The 34 sequences were divided into n-, h- and c-regions, and the amino acid content compared with that of eukaryotes and bacteria. The H.influenzae genome (Fleischmann et al., 1995
) served as a reference example of a Gram-negative bacterium. In Figure 3
, the 34 putative M.jannaschii SPs are represented as a sequence logo, i.e. a sequence of stacked letters, where the total height of the stack at each position shows the amount of information (conservation), while the relative height of each letter shows the relative abundance of the corresponding amino acid (Schneider and Stephens, 1990
). When compared with logos of eukaryotic or bacterial SPs (Nielsen et al., 1997a
), the following characteristics are observed.
|
In the n-region, the content of Lys is very high, while Arg is relatively rare. A positively charged n-region is also found in bacterial SPs, but in these Arg and Lys are present in more equal proportions. The Lys content of M.jannaschii n-regions is approximately 30% compared with 20% in H.influenzae. A very characteristic feature is the high content of Ile in the h-region. This is not limited to signal peptides, as Ile is strongly over-represented in M.jannaschii as compared with H.influenzae also in transmembrane regions (16 versus 12%) and in the genome as a whole (10.5 versus 7.1%). However, the difference is more drastic for the h-regions (22 versus 11%).
In the c-region, the dominance of Ala at position 1 is typical for both bacterial and eukaryotic signal peptide cleavage sites, whereas the tolerance of other uncharged residues, such as Val, Leu and Ile, at 3 and the short length of the c-region clearly suggest a eukaryotic type of cleavage site. Around the cleavage site, a unique feature is also found: a high occurrence of Tyr (8% of the c-regions as opposed to 2% in H.influenzae), particularly visible at positions +1 and 2. This seems to be specific for SPs, since the general Tyr content is only slightly higher in M.jannaschii than in H.influenzae (4.3 versus 3.3%). Finally, the occurrence of negatively charged residues in the first few positions of the mature protein has previously been noted for bacterial but not for eukaryotic signal peptides (von Heijne, 1986a
).
In conclusion, our analysis suggests that SPs from an archaeon have a eukaryotic-looking cleavage site, a bacterial-looking charge distribution and a unique composition of the hydrophobic region. The statistical description is of course to some extent affected by the fact that we use a consensus method, which only finds signal peptides and cleavage sites that would be acceptable in both eukaryotes and bacteria; chances are that signal peptides peculiar to archaea have gone undiscovered. In other words, we have if anything underestimated the unique characteristics of the M.jannaschii signal peptides.
| Other protein sorting prediction methods |
|---|
|
|
|---|
ChloroP is the equivalent of SignalP for predicting chloroplast transit peptides (cTPs), and has been constructed in much the same way (O.Emanuelsson, H.Nielsen and G.von Heijne, manuscript submitted). Two novel aspects are that the yes/no cTP prediction is based on a NN trained on the S-score outputs from the basic NN, and that the cleavage site prediction is not done using a NN but by a simple weight matrix. The weight matrix approach was chosen since a recent experimental study of the cTP processing enzyme stromal processing peptidase (SPP) suggested that the mature N-terminus of chloroplast proteins is often generated by an ill-defined proteolytic removal of one or a few extra residues after the initial SPP cleavage (Richter and Lamppa, 1998
The currently most developed method to predict mTPs is based on a linear combination of a number of sequence characteristics such as amino acid abundance, maximum hydrophobicity and maximum hydrophobic moment that are combined into an overall score (Claros and Vincens, 1996
). Preliminary work using the same NN approach as for ChloroP suggests that similar performance levels can be reached using machine learning (our unpublished data).
In addition to the recognition of the sorting signals, prediction of protein sorting can exploit the fact that proteins of different subcellular compartments differ in global properties such as amino acid composition and residue-pair frequencies. While the signal prediction methods are probably closer to mimicking the information processing in the cell, methods based on global properties can complement imperfect signal-based methods, especially on incomplete sequences. Specifically, a composition-based method for recognizing extracellular proteins can be used without knowledge of the N-terminus, and could, for example, give correct predictions for EST-derived protein fragments where the signal peptide has not even been sequenced. The drawback is that such methods will not be able to distinguish between very closely related proteins that differ in the presence or absence of a SP. Most of the work on such methods has been based on traditional statistics (Nakashima and Nishikawa, 1994
; Cedano et al., 1997
), but machine learning has been employed in the NNPSL method, which uses NNs trained on overall amino acid composition to predict location to three (bacteria) or four (eukaryotes) possible subcellular compartments (Reinhardt and Hubbard, 1998
).
The PSORT program (Nakai and Kanehisa, 1992
; Horton and Nakai, 1997
) is an integrated system of several prediction methods, using both sorting signals and global properties. Some of the components are developed within the PSORT group, others are implementations of methods published elsewhere. PSORT is the only publicly available system that shows this degree of integration, and it includes sorting predictions that are not found elsewhere (e.g. nuclear or peroxisomal targeting). However, it does not include the newest machine-learning methods, which means that PSORT prediction of the more extensively studied protein sorting problems, e.g. SPs or transmembrane helices, is in many cases not the best available.
| The future |
|---|
|
|
|---|
With the recent advances in prediction methods for protein sorting, the vision of a computer program that is able to predict the subcellular location of almost any given protein with high confidence seems not entirely unrealistic. This would be an integrated system of sorting signal predictors and methods based on overall amino acid composition, and as described above, start codon prediction and transmembrane helix prediction should be included. A major use of such a program would be automatic annotation of sequence databases, including complete genomes.
On the other hand, one big integrated system of all methods may not be the most desirable solution for all users. For automated annotation of very large data sets, integrated prediction systems are of course preferable, but the biologist working on one specific gene might be better off considering comprehensive graphical output from several prediction methods separately, and then deciding which conclusion should be drawn from the possibly conflicting predictions. In some cases (rare but interesting), the biologically correct answer will be something not anticipated by the method builders (e.g. dual targeting, double cleavage, non-standard use of sorting machineries), and uncritical use of a totally integrated prediction system could actually block new discoveries instead of promoting them.
Finally, any given application will require careful consideration of how to strike the best balance between sensitivity and specificity. For gene hunting, one may want high sensitivity (i.e. few false negatives) in order not to miss interesting candidate genes, whereas for database annotation it may be more prudent to ask for high specificity (i.e. few false positives) even if this will leave many sequences unannotated.
The trade-off between sensitivity and specificity illustrates a common aspect in the evaluation of prediction methods. Performances are given as percent correct, correlation coefficients etc., but these depend on the choice of cut-off and the definition of positive and negative data sets. In the signal peptide case, it is quite clear what the positive data sets should be, although it may be argued whether, for example, bacterial lipoproteins should be considered as positive examples. On the other hand, there are many questions to be asked about negative examples: should they comprise only soluble cytoplasmic and nuclear proteins, or include transmembrane and membrane-associated proteins? Should they be limited to N-terminal parts or include entire protein chains? There is no single correct answer to questions like these, which makes comparison of performances of different methods a very tricky business.
Since numerical performance measures are mandatory for deciding whether methods have improved, the task of defining such measures is very important, and much more work is needed within the bioinformatics field in order to arrive at common testing standards for method comparison (Nielsen et al., 1996
). However, we feel that the most informative test of the performance and applicability of a sequence-based prediction method is carried out by making it available to the biological community, both in academia and in industry, e.g. by implementing it as a server or a portable program. The feedback from users, either directly, or implicitly via usage and citation statistics, can tell us more about the quality of our bioinformatics work than percentages and correlation coefficients will ever be able to.
| Availability of methods |
|---|
|
|
|---|
SignalP, TMHMM, NetStart and ChloroP are all available under the prediction server page of Center for Biological Sequence Analysis (http://www.cbs.dtu.dk/services/). For transmembrane helix prediction, two possibilities in addition to TMHMM (our apologies to several others not mentioned here) are PHDhtm (http://www.embl-heidelberg.de/predictprotein/) and TopPred (http://www.biokemi.su.se/server/ toppred2/). PSORT is found at http://psort.nibb.ac.jp/, and NNPSL at http://predict.sanger.ac.uk/nnpsl/.
| Acknowledgments |
|---|
We would like to thank our co-workers in the protein sorting field: Olof Emanuelsson (ChloroP), Erik Sonnhammer (TMHMM), Anders Krogh (TMHMM, SignalP-HMM) and Anders Gorm Pedersen (NetStart). Figure 2
| Notes |
|---|
1 To whom correspondence should be addressed
| References |
|---|
|
|
|---|
Altschul,S. and Gish,W. (1996) Methods Enzymol., 266, 460480.[Web of Science][Medline]
Bailey,T. and Elkan,C. (1994) ISMB, 2, 2836.
Bairoch,A. and Apweiler,R. (1997) Nucleic Acids Res., 25, 3136.
Baldi,P. and Brunak,S. (1998) Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge.
Brunak,S. (1993) In Soumpasis,D. and Jovin,T. (eds) Computation of Biomolecular StructuresAchievements, Problems and Perspectives. Springer-Verlag, Berlin, pp. 4354.
Brunak,S., Engelbrecht,J. and Knudsen,S. (1990a) Nature, 343, 123.[Medline]
Brunak,S., Engelbrecht,J. and Knudsen,S. (1990b) Nucleic Acids Res., 18, 47974801.
Bult,C.J., White,O., Olsen,G.J. et al. (1996) Science, 273, 10581073.[Abstract]
Cedano,J., Aloy,P., Pérez-Pons,J. and Querol,E. (1997) J. Mol. Biol., 266, 594600.[Web of Science][Medline]
Chou,M.M. and Kendall,D.A. (1990) J. Biol. Chem., 265, 28732880.
Claros,M.G. and Vincens,P. (1996) Eur. J. Biochem., 241, 779786.[Web of Science][Medline]
Durbin,R.M., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge.
Fleischmann,R.D., Adams,M.D., White,O. et al. (1995) Science, 269, 496512.
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.[Web of Science][Medline]
Horton,P. and Nakai,K. (1997) ISMB, 5, 147152.
Kozak,M. (1984) Nucleic Acids Res., 12, 857872.
Ladunga,I., Czakó,F., Csabai,I. and Geszti,T. (1991) CABIOS, 7, 485487.
Mathews,B. (1975) Biochim. Biophys. Acta, 405, 442451.[Medline]
McGeoch,D.J. (1985) Virus Res., 3, 271286.[Web of Science][Medline]
Nakai,K. and Kanehisa,M. (1992) Genomics, 14, 897911.[Web of Science][Medline]
Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 5461.[Web of Science][Medline]
Nielsen,H., Brunak,S., Engelbrecht,J. and von Heijne,G. (1997a) Protein Engng, 10, 16.
Nielsen,H., Brunak,S., Engelbrecht,J. and von Heijne,G. (1997b) Int. J. Neural Sys., 8, in press.
Nielsen,H., Engelbrecht,J., von Heijne,G. and Brunak,S. (1996) Proteins, 24, 165177.[Web of Science][Medline]
Nilsson,I., Whitley,P. and von Heijne,G. (1994) J. Cell Biol., 126, 11271132.
Olsen,G. and Woese,C. (1997) Cell, 89, 991994.[Web of Science][Medline]
Pedersen,A.G. and Nielsen,H. (1997) ISMB, 5, 226233.
Reinhardt,A. and Hubbard,T. (1998) Nucleic Acids Res., 26, 22302236.
Richter,S. and Lamppa,G. (1998) Proc. Natl Acad. Sci. USA, 95, 74637468.
Rost,B., Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.[Web of Science][Medline]
Schneider,G. and Wrede,P. (1993) J. Mol. Evol., 36, 586595.[Web of Science][Medline]
Schneider,T.D. and Stephens,R.M. (1990) Nucleic Acids Res., 18, 60976100.
Sonnhammer,E.L., von Heijne,G. and Krogh,A. (1998) ISMB, 6, 175182.
von Heijne,G. (1983) Eur. J. Biochem., 133, 1721.[Web of Science][Medline]
von Heijne,G. (1985) J. Mol. Biol., 184, 99105.[Web of Science][Medline]
von Heijne,G. (1986a) J. Mol. Biol., 192, 287290.[Web of Science][Medline]
von Heijne,G. (1986b) Nucleic Acids Res., 14, 46834690.
von Heijne,G. (1988) Biochim. Biophys. Acta, 947, 307333.[Medline]
von Heijne,G. (1992) J. Mol. Biol., 225, 487494.[Web of Science][Medline]
Received November 23, 1998; accepted November 24, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
B. Bostan, R. Greiner, D. Szafron, and P. Lu Predicting homologous signaling pathways using machine learning Bioinformatics, November 15, 2009; 25(22): 2913 - 2920. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. E. Rumpho, S. Pochareddy, J. M. Worful, E. J. Summer, D. Bhattacharya, K. N. Pelletreau, M. S. Tyler, J. Lee, J. R. Manhart, and K. M. Soule Molecular Characterization of the Calvin Cycle Enzyme Phosphoribulokinase in the Stramenopile Alga Vaucheria litorea and the Plastid Hosting Mollusc Elysia chlorotica Mol Plant, November 1, 2009; 2(6): 1384 - 1396. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Weber, A. Gruber, and P. G. Kroth The Presence and Localization of Thioredoxins in Diatoms, Unicellular Algae of Secondary Endosymbiotic Origin Mol Plant, May 1, 2009; 2(3): 468 - 477. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.G. Bagos, K.D. Tsirigos, S.K. Plessas, T.D. Liakopoulos, and S.J. Hamodrakas Prediction of signal peptides in archaea Protein Eng. Des. Sel., January 1, 2009; 22(1): 27 - 35. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Ratner, J. Cui, M. Steffen, L. L. Moore, P. W. Robbins, and J. Samuelson Changes in the N-Glycome, Glycoproteins with Asn-Linked Glycans, of Giardia lamblia with Differentiation from Trophozoites to Cysts Eukaryot. Cell, November 1, 2008; 7(11): 1930 - 1940. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Miller, L. J. Jensen, F. Diella, C. Jorgensen, M. Tinti, L. Li, M. Hsiung, S. A. Parker, J. Bordeaux, T. Sicheritz-Ponten, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling Sci. Signal., September 2, 2008; 1(35): ra2 - ra2. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Magnelli, J. F. Cipollo, D. M. Ratner, J. Cui, D. Kelleher, R. Gilmore, C. E. Costello, P. W. Robbins, and J. Samuelson Unique Asn-linked Oligosaccharides of the Human Pathogen Entamoeba histolytica J. Biol. Chem., June 27, 2008; 283(26): 18355 - 18364. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. V. Tetko, I. V. Rodchenkov, M. C. Walter, T. Rattei, and H.-W. Mewes Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information Bioinformatics, March 1, 2008; 24(5): 621 - 628. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bettegowda, J. Yao, A. Sen, Q. Li, K.-B. Lee, Y. Kobayashi, O. V. Patel, P. M. Coussens, J. J. Ireland, and G. W. Smith JY-1, an oocyte-specific gene, regulates granulosa cell function and early embryonic development in cattle PNAS, November 6, 2007; 104(45): 17602 - 17607. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Banerjee, P. Vishwanath, J. Cui, D. J. Kelleher, R. Gilmore, P. W. Robbins, and J. Samuelson The evolution of N-glycan-dependent endoplasmic reticulum quality control factors for glycoprotein folding and degradation PNAS, July 10, 2007; 104(28): 11676 - 11681. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Harant, B. Wolff, E. P. Schreiner, B. Oberhauser, L. Hofer, N. Lettner, S. Maier, J. E. de Vries, and I. J. Lindley Inhibition of Vascular Endothelial Growth Factor Cotranslational Translocation by the Cyclopeptolide CAM741 Mol. Pharmacol., June 1, 2007; 71(6): 1657 - 1665. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Severin, E. Nickbarg, J. Wooters, S. A. Quazi, Y. V. Matsuka, E. Murphy, I. K. Moutsatsos, R. J. Zagursky, and S. B. Olmsted Proteomic Analysis and Identification of Streptococcus pyogenes Surface-Associated Proteins J. Bacteriol., March 1, 2007; 189(5): 1514 - 1522. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. M. Fuchs, S. Spring, H. Teeling, C. Quast, J. Wulf, M. Schattenhofer, S. Yan, S. Ferriera, J. Johnson, F. O. Glockner, et al. From the Cover: Characterization of a marine gammaproteobacterium capable of aerobic anoxygenic photosynthesis PNAS, February 20, 2007; 104(8): 2891 - 2896. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Y. M. Ng, B. Chaban, D. J. VanDyke, and K. F. Jarrell Archaeal signal peptidases Microbiology, February 1, 2007; 153(2): 305 - 314. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Colin, E. Deniaud, M. Jam, V. Descamps, Y. Chevolot, N. Kervarec, J.-C. Yvin, T. Barbeyron, G. Michel, and B. Kloareg Cloning and biochemical characterization of the fucanase FcnA: definition of a novel glycoside hydrolase family specific for sulfated fucans Glycobiology, November 1, 2006; 16(11): 1021 - 1032. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Harant, N. Lettner, L. Hofer, B. Oberhauser, J. E. de Vries, and I. J. D. Lindley The Translocation Inhibitor CAM741 Interferes with Vascular Cell Adhesion Molecule 1 Signal Peptide Insertion at the Translocon J. Biol. Chem., October 13, 2006; 281(41): 30492 - 30502. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Reynolds, A. A. Ribeiro, S. C. McGrath, R. J. Cotter, C. R. H. Raetz, and M. S. Trent An Outer Membrane Enzyme Encoded by Salmonella typhimurium lpxR That Removes the 3'-Acyloxyacyl Moiety of Lipid A J. Biol. Chem., August 4, 2006; 281(31): 21974 - 21987. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Guo and Y. Lin TSSub: eukaryotic protein subcellular localization by extracting features from profiles Bioinformatics, July 15, 2006; 22(14): 1784 - 1785. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. R. Gilson, T. Nebl, D. Vukcevic, R. L. Moritz, T. Sargeant, T. P. Speed, L. Schofield, and B. S. Crabb Identification and Stoichiometry of Glycosylphosphatidylinositol-anchored Membrane Proteins of the Human Malaria Parasite Plasmodium falciparum Mol. Cell. Proteomics, July 1, 2006; 5(7): 1286 - 1299. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. L. Van Dellen, A. Chatterjee, D. M. Ratner, P. E. Magnelli, J. F. Cipollo, M. Steffen, P. W. Robbins, and J. Samuelson Unique Posttranslational Modifications of Chitin-Binding Lectins of Entamoeba invadens Cyst Walls Eukaryot. Cell, May 1, 2006; 5(5): 836 - 848. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Liu, G. Baggerman, W. D'Hertog, P. Verleyen, L. Schoofs, and G. Wets In Silico Identification of New Secretory Peptide Genes in Drosophila melanogaster Mol. Cell. Proteomics, March 1, 2006; 5(3): 510 - 522. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Okamoto, A. Kumar, W. Li, Y. Wang, M. Y. Siddiqi, N. M. Crawford, and A. D.M. Glass High-Affinity Nitrate Transport in Roots of Arabidopsis Depends on Expression of the NAR2-Like Gene AtNRT3.1 Plant Physiology, March 1, 2006; 140(3): 1036 - 1046. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Roper Characterization of the Vaccinia Virus A35R Protein and Its Role in Virulence J. Virol., January 1, 2006; 80(1): 306 - 313. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. H. Thomas, T. Southworth, M. R. Leon-Kempis, A. Leech, and D. J. Kelly Novel ligands for the extracellular solute receptors of two bacterial TRAP transporters Microbiology, January 1, 2006; 152(1): 187 - 198. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Obornik and B. R. Green Mosaic Origin of the Heme Biosynthesis Pathway in Photosynthetic Eukaryotes Mol. Biol. Evol., December 1, 2005; 22(12): 2343 - 2353. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Bekaert, H. Richard, B. Prum, and J.-P. Rousset Identification of programmed translational -1 frameshifting sites in the genome of Saccharomyces cerevisiae Genome Res., October 1, 2005; 15(10): 1411 - 1420. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Ferguson, M. R. Muenster, Q. Zang, J. A. Spencer, J. J. Schageman, Y. Lian, H. R. Garner, R. B. Gaynor, J. W. Huff, A. Pertsemlidis, et al. Selective Identification of Secreted and Transmembrane Breast Cancer Markers using Escherichia coli Ampicillin Secretion Trap Cancer Res., September 15, 2005; 65(18): 8209 - 8217. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Eichler and M. W. W. Adams Posttranslational Protein Modification in Archaea Microbiol. Mol. Biol. Rev., September 1, 2005; 69(3): 393 - 425. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Pidasheva, L. Canaff, W. F. Simonds, S. J. Marx, and G. N. Hendy Impaired cotranslational processing of the calcium-sensing receptor due to signal peptide missense mutations in familial hypocalciuric hypercalcemia Hum. Mol. Genet., June 15, 2005; 14(12): 1679 - 1690. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Geurtsen, L. Steeghs, J. t. Hove, P. van der Ley, and J. Tommassen Dissemination of Lipid A Deacylases (PagL) among Gram-negative Bacteria: IDENTIFICATION OF ACTIVE-SITE HISTIDINE AND SERINE RESIDUES J. Biol. Chem., March 4, 2005; 280(9): 8248 - 8259. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Wenzl, L. Wong, K. Kwang-won, and R. A. Jefferson A Functional Screen Identifies Lateral Transfer of {beta}-Glucuronidase (gus) from Bacteria to Fungi Mol. Biol. Evol., February 1, 2005; 22(2): 308 - 316. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. R. Henderson, F. Navarro-Garcia, M. Desvaux, R. C. Fernandez, and D. Ala'Aldeen Type V Protein Secretion Pathway: the Autotransporter Story Microbiol. Mol. Biol. Rev., December 1, 2004; 68(4): 692 - 744. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. O. Stitziel, B. G. Mar, J. Liang, and C. A. Westbrook Membrane-Associated and Secreted Genes in Breast Cancer Cancer Res., December 1, 2004; 64(23): 8682 - 8687. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Barbe, D. Vallenet, N. Fonknechten, A. Kreimeyer, S. Oztas, L. Labarre, S. Cruveiller, C. Robert, S. Duprat, P. Wincker, et al. Unique features revealed by the genome sequence of Acinetobacter sp. ADP1, a versatile and naturally transformation competent bacterium Nucleic Acids Res., October 28, 2004; 32(19): 5766 - 5779. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Salusjarvi, N. Kalkkinen, and A. N. Miasnikov Cloning and Characterization of Gluconolactone Oxidase of Penicillium cyaneo-fulvum ATCC 10431 and Evaluation of Its Use for Production of D-Erythorbic Acid in Recombinant Pichia pastoris Appl. Envir. Microbiol., September 1, 2004; 70(9): 5503 - 5510. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Xu, Y. Barak, R. Kenig, Y. Shoham, E. A. Bayer, and R. Lamed A Novel Acetivibrio cellulolyticus Anchoring Scaffoldin That Bears Divergent Cohesins J. Bacteriol., September 1, 2004; 186(17): 5782 - 5789. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Egelund, M. Skjot, N. Geshi, P. Ulvskov, and B. L. Petersen A Complementary Bioinformatics Approach to Identify Potential Plant Cell Wall Glycosyltransferase-Encoding Genes Plant Physiology, September 1, 2004; 136(1): 2609 - 2620. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Li, J. Dong, and D. A. Harris Cell Surface Expression of the Prion Protein in Yeast Does Not Alter Copper Utilization Phenotypes J. Biol. Chem., July 9, 2004; 279(28): 29469 - 29477. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Fingerhut, S. Reutrakul, S. D. Knuedeler, L. C. Moeller, C. Greenlee, S. Refetoff, and O. E. Janssen Partial Deficiency of Thyroxine-Binding Globulin-Allentown Is Due to a Mutation in the Signal Peptide J. Clin. Endocrinol. Metab., May 1, 2004; 89(5): 2477 - 2483. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. W. Klee, D. F. Carlson, S. C. Fahrenkrug, S. C. Ekker, and L. B. M. Ellis Identifying secretomes in people, pufferfish and pigs Nucleic Acids Res., February 27, 2004; 32(4): 1414 - 1421. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Kappler, K.-F. Aguey-Zinsou, G. R. Hanson, P. V. Bernhardt, and A. G. McEwan Cytochrome c551 from Starkeya novella: CHARACTERIZATION, SPECTROSCOPIC PROPERTIES, AND PHYLOGENY OF A DIHEME PROTEIN OF THE SoxAX FAMILY J. Biol. Chem., February 20, 2004; 279(8): 6252 - 6260. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Xu, E. A. Bayer, M. Goldman, R. Kenig, Y. Shoham, and R. Lamed Architecture of the Bacteroides cellulosolvens Cellulosome: Description of a Cell Surface-Anchoring Scaffoldin and a Family 48 Cellulase J. Bacteriol., February 15, 2004; 186(4): 968 - 977. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Niederkofler, R. Salie, M. Sigrist, and S. Arber Repulsive Guidance Molecule (RGM) Gene Function Is Required for Neural Tube Closure But Not Retinal Topography in the Mouse Visual System J. Neurosci., January 28, 2004; 24(4): 808 - 818. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Vogel, S. A. Teichmann, and C. Chothia The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity Development, December 22, 2003; 130(25): 6317 - 6328. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Xiong, Q. Zhao, Z. Rong, G. Huang, Y. Huang, P. Chen, S. Zhang, L. Liu, and Z. Chang hSef Inhibits PC-12 Cell Differentiation by Interfering with Ras-Mitogen-activated Protein Kinase MAPK Signaling J. Biol. Chem., December 12, 2003; 278(50): 50273 - 50282. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Eichler Facing extremes: archaeal surface-layer (glyco)proteins Microbiology, December 1, 2003; 149(12): 3347 - 3351. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Eisenhaber, M. Wildpaner, C. J. Schultz, G. H.H. Borner, P. Dupree, and F. Eisenhaber Glycosylphosphatidylinositol Lipid Anchoring of Plant Proteins. Sensitive Prediction from Sequence- and Genome-Wide Studies for Arabidopsis and Rice Plant Physiology, December 1, 2003; 133(4): 1691 - 1701. [Abstract] [Full Text] |
||||
![]() |
S. Fukusumi, H. Yoshida, R. Fujii, M. Maruyama, H. Komatsu, Y. Habata, Y. Shintani, S. Hinuma, and M. Fujino A New Peptidic Ligand and Its Receptor Regulating Adrenal Function in Rats J. Biol. Chem., November 21, 2003; 278(47): 46387 - 46395. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Jensen, D. W. Ussery, and S. Brunak Functionality of System Components: Conservation of Protein Function in Protein Feature Space Genome Res., November 1, 2003; 13(11): 2444 - 2449. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Y. M. Ng and K. F. Jarrell Cloning and Characterization of Archaeal Type I Signal Peptidase from Methanococcus voltae J. Bacteriol., October 15, 2003; 185(20): 5936 - 5942. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-D. Liao, S.-C. Wang, Y.-J. Leu, C.-F. Wang, S.-T. Chang, Y.-T. Hong, Y.-R. Pan, and C. Chen The structural integrity exerted by N-terminal pyroglutamate is crucial for the cytotoxicity of frog ribonuclease from Rana pipiens Nucleic Acids Res., September 15, 2003; 31(18): 5247 - 5255. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Y. Jeong, A. Rose, and I. Meier MFP1 is a thylakoid-associated, nucleoid-binding protein with a coiled-coil structure Nucleic Acids Res., September 1, 2003; 31(17): 5175 - 5185. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Hall, M. Berriman, N. J. Lennard, B. R. Harris, C. Hertz-Fowler, E. N. Bart-Delabesse, C. S. Gerrard, R. J. Atkin, A. J. Barron, S. Bowman, et al. The DNA sequence of chromosome I of an African trypanosome: gene content, chromosome organisation, recombination and polymorphism Nucleic Acids Res., August 15, 2003; 31(16): 4864 - 4873. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Xu, W. Gao, S.-Y. Ding, R. Kenig, Y. Shoham, E. A. Bayer, and R. Lamed The Cellulosome System of Acetivibrio cellulolyticus Includes a Novel Type of Adaptor Protein and a Cell Surface Anchoring Protein J. Bacteriol., August 1, 2003; 185(15): 4548 - 4557. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. A. Eyrich and B. Rost META-PP: single interface to crucial prediction servers Nucleic Acids Res., July 1, 2003; 31(13): 3308 - 3310. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Eisenhaber, B. Eisenhaber, W. Kubina, S. Maurer-Stroh, G. Neuberger, G. Schneider, and M. Wildpaner Prediction of lipid posttranslational modifications and localization signals from protein sequences: big-{Pi}, NMT and PTS1 Nucleic Acids Res., July 1, 2003; 31(13): 3631 - 3634. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yuge, K. Inoue, S. Hyodo, and Y. Takei A Novel Guanylin Family (Guanylin, Uroguanylin, and Renoguanylin) in Eels: POSSIBLE OSMOREGULATORY HORMONES IN INTESTINE AND KIDNEY J. Biol. Chem., June 13, 2003; 278(25): 22726 - 22733. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Gao, M. W. Bauer, K. R. Shockley, M. A. Pysz, and R. M. Kelly Growth of Hyperthermophilic Archaeon Pyrococcus furiosus on Chitin Involves Two Family 18 Chitinases Appl. Envir. Microbiol., June 1, 2003; 69(6): 3119 - 3128. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Yang, G. Hu, S.-W. Wang, Y. Li, R. Martin, K. Li, and Z. Yao Calcineurin/Nuclear Factors of Activated T Cells (NFAT)-activating and Immunoreceptor Tyrosine-based Activation Motif (ITAM)-containing Protein (CNAIP), a Novel ITAM-containing Protein That Activates the Calcineurin/NFAT-signaling Pathway J. Biol. Chem., May 2, 2003; 278(19): 16797 - 16801. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Neubauer, A. Bauche, and B. Mollet Molecular characterization and expression analysis of the dextransucrase DsrD of Leuconostoc mesenteroides Lcc4 in homologous and heterologous Lactococcus lactis cultures Microbiology, April 1, 2003; 149(4): 973 - 982. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Nakashima, S. Fukuchi, and K. Nishikawa Compositional Changes in RNA, DNA and Proteins for Bacterial Adaptation to Higher and Lower Temperatures J. Biochem., April 1, 2003; 133(4): 507 - 513. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. L. S. Que-Gewirth, M. J. Karbarz, S. R. Kalb, R. J. Cotter, and C. R. H. Raetz Origin of the 2-Amino-2-deoxy-gluconate Unit in Rhizobium leguminosarum Lipid A. EXPRESSION CLONING OF THE OUTER MEMBRANE OXIDASE LpxQ J. Biol. Chem., March 28, 2003; 278(14): 12120 - 12129. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Okamoto, J. J. Vidmar, and A. D. M. Glass Regulation of NRT1 and NRT2 Gene Families of Arabidopsis thaliana: Responses to Nitrate Provision Plant Cell Physiol., March 15, 2003; 44(3): 304 - 317. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Puehler, H. Schwarz, B. Waidner, J. Kalinowski, B. Kaspers, S. Bereswill, and P. Staeheli An Interferon-gamma -binding Protein of Novel Structure Encoded by the Fowlpox Virus J. Biol. Chem., February 21, 2003; 278(9): 6905 - 6911. [Abstract] [Full Text] [PDF] |
||||
![]() |
E.-M. Lai, N. D. Phadke, M. T. Kachman, R. Giorno, S. Vazquez, J. A. Vazquez, J. R. Maddock, and A. Driks Proteomic Analysis of the Spore Coats of Bacillus subtilis and Bacillus anthracis J. Bacteriol., February 15, 2003; 185(4): 1443 - 1454. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Homer, M. J. Lodes, L. D. Reynolds, Y. Zhang, J. F. Douglass, P. D. McNeill, R. L. Houghton, and D. H. Persing Identification and Characterization of Putative Secreted Antigens from Babesia microti J. Clin. Microbiol., February 1, 2003; 41(2): 723 - 729. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Muller, D. Lievremont, D. D. Simeonova, J.-C. Hubert, and M.-C. Lett Arsenite Oxidase aox Genes from a Metal-Resistant {beta}-Proteobacterium J. Bacteriol., January 1, 2003; 185(1): 135 - 141. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. H. Jongbloed, H. Antelmann, M. Hecker, R. Nijland, S. Bron, U. Airaksinen, F. Pries, W. J. Quax, J. M. van Dijl, and P. G. Braun Selective Contribution of the Twin-Arginine Translocation Pathway to Protein Secretion in Bacillus subtilis J. Biol. Chem., November 8, 2002; 277(46): 44068 - 44078. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bolhuis Protein transport in the halophilic archaeon Halobacterium sp. NRC-1: a major role for the twin-arginine translocation pathway? Microbiology, November 1, 2002; 148(11): 3335 - 3346. [Full Text] [PDF] |
||||
![]() |
J. Tolle, K.-P. Michel, J. Kruip, U. Kahmann, A. Preisfeld, and E. K. Pistorius Localization and function of the IdiA homologue Slr1295 in the cyanobacterium Synechocystis sp. strain PCC 6803 Microbiology, October 1, 2002; 148(10): 3293 - 3305. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Collins, M.-A. Meuwis, I. Stals, M. Claeyssens, G. Feller, and C. Gerday A Novel Family 8 Xylanase, Functional and Physicochemical Characterization J. Biol. Chem., September 13, 2002; 277(38): 35133 - 35139. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Shagin, E. V. Barsova, E. A. Bogdanova, O. V. Britanova, N. G. Gurskaya, K. A. Lukyanov, M. V. Matz, N. I. Punkova, N. Y. Usman, E. P. Kopantzev, et al. Identification and characterization of a new family of C-type lectin-like genes from planaria Girardia tigrina Glycobiology, August 1, 2002; 12(8): 463 - 472. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Mallick, D. R. Boutz, D. Eisenberg, and T. O. Yeates Genomic evidence that the intracellular proteins of archaeal microbes contain disulfide bonds PNAS, July 23, 2002; 99(15): 9679 - 9684. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-Y. Chen, K. J. Cross, R. A. Paolini, J. E. Fielding, N. Slakeski, and E. C. Reynolds CPG70 Is a Novel Basic Metallocarboxypeptidase with C-terminal Polycystic Kidney Disease Domains from Porphyromonas gingivalis J. Biol. Chem., June 21, 2002; 277(26): 23433 - 23440. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Sanchez, M. J. Vincent, and S. T. Nichol Characterization of the Glycoproteins of Crimean-Congo Hemorrhagic Fever Virus J. Virol., June 14, 2002; 76(14): 7263 - 7275. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Van Dellen, S. K. Ghosh, P. W. Robbins, B. Loftus, and J. Samuelson Entamoeba histolytica Lectins Contain Unique 6-Cys or 8-Cys Chitin-Binding Domains Infect. Immun., June 1, 2002; 70(6): 3259 - 3263. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Biondo, C. Beninati, D. Delfino, M. Oggioni, G. Mancuso, A. Midiri, M. Bombaci, G. Tomaselli, and G. Teti Identification and Cloning of a Cryptococcal Deacetylase That Produces Protective Immune Responses Infect. Immun., May 1, 2002; 70(5): 2383 - 2391. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Bao, Y. Tian, W. Li, Z. Xu, Z. Xuan, S. Hu, W. Dong, J. Yang, Y. Chen, Y. Xue, et al. A Complete Sequence of the T. tengcongensis Genome Genome Res., May 1, 2002; 12(5): 689 - 700. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. John, M. Arai, D. A. Rubin, K. B. Jonsson, and H. Juppner Identification and Characterization of the Murine and Human Gene Encoding the Tuberoinfundibular Peptide of 39 Residues Endocrinology, March 1, 2002; 143(3): 1047 - 1057. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schubert, U. A. Petersson, B. J. Haas, C. Funk, W. P. Schroder, and T. Kieselbach Proteome Map of the Chloroplast Lumen of Arabidopsis thaliana J. Biol. Chem., March 1, 2002; 277(10): 8354 - 8365. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Bohme and G. A. M. Cross Mutational analysis of the variant surface glycoprotein GPI-anchor signal sequence in Trypanosoma brucei J. Cell Sci., February 15, 2002; 115(4): 805 - 816. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. R. Eddy, S. Griffiths-Jones, K. L. Howe, M. Marshall, and E. L. L. Sonnhammer The Pfam Protein Families Database Nucleic Acids Res., January 1, 2002; 30(1): 276 - 280. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-B. Peltier, O. Emanuelsson, D. E. Kalume, J. Ytterberg, G. Friso, A. Rudella, D. A. Liberles, L. Soderberg, P. Roepstorff, G. von Heijne, et al. Central Functions of the Lumenal and Peripheral Thylakoid Proteome of Arabidopsis Determined by Experimentation and Genome-Wide Prediction PLANT CELL, January 1, 2002; 14(1): 211 - 236. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Fan, C. Y. Wu, C. W. Chen, T. W. Chang, and C. Lim Preparing a human membrane and secreted protein-enriched cDNA library using PCR primers derived from a genomic database Nucleic Acids Res., November 15, 2001; 29(22): e114 - e114. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Hashimoto, T. Niikura, H. Tajima, T. Yasukawa, H. Sudo, Y. Ito, Y. Kita, M. Kawasumi, K. Kouyama, M. Doyu, et al. A rescue factor abolishing neuronal cell death by a wide spectrum of familial Alzheimer's disease genes and Abeta PNAS, May 22, 2001; 98(11): 6336 - 6341. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. S. Davis, G.-J. J. Chang, B. Cropp, J. T. Roehrig, D. A. Martin, C. J. Mitchell, R. Bowen, and M. L. Bunning West Nile Virus Recombinant DNA Vaccine Protects Mouse and Horse from Virus Challenge and Expresses In Vitro a Noninfectious Recombinant Antigen That Can Be Used in Enzyme-Linked Immunosorbent Assays J. Virol., May 1, 2001; 75(9): 4040 - 4047. [Abstract] [Full Text] |
||||
![]() |
P. S. Mercuri, F. Bouillenne, L. Boschi, J. Lamotte-Brasseur, G. Amicosante, B. Devreese, J. van Beeumen, J.-M. Frère, G. M. Rossolini, and M. Galleni Biochemical Characterization of the FEZ-1 Metallo-{beta}-Lactamase of Legionella gormanii ATCC 33297T Produced in Escherichia coli Antimicrob. Agents Chemother., April 1, 2001; 45(4): 1254 - 1262. [Abstract] [Full Text] |
||||
![]() |
M. Göttfert, S. Röthlisberger, C. Kündig, C. Beck, R. Marty, and H. Hennecke Potential Symbiosis-Specific Genes Uncovered by Sequencing a 410-Kilobase DNA Region of the Bradyrhizobium japonicum Chromosome J. Bacteriol., February 15, 2001; 183(4): 1405 - 1412. [Abstract] [Full Text] |
||||
![]() |
K.-C. Chou Using subsite coupling to predict signal peptides Protein Eng. Des. Sel., February 1, 2001; 14(2): 75 - 79. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Eisenhaber, P. Bork, and F. Eisenhaber Post-translational GPI lipid anchor modification of proteins in kingdoms of life: analysis of protein sequence data from complete genomes Protein Eng. Des. Sel., January 1, 2001; 14(1): 17 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Frisardi, S. K. Ghosh, J. Field, K. Van Dellen, R. Rogers, P. Robbins, and J. Samuelson The Most Abundant Glycoprotein of Amebic Cyst Walls (Jacob) Is a Lectin with Five Cys-Rich, Chitin-Binding Domains Infect. Immun., July 1, 2000; 68(7): 4217 - 4224. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. P. Widney, Y.-R. Xia, A. J. Lusis, and J. B. Smith The Murine Chemokine CXCL11 (IFN-Inducible T Cell {alpha} Chemoattractant) Is an IFN-{gamma}- and Lipopolysaccharide- Inducible Glucocorticoid-Attenuated Response Gene Expressed in Lung and Other Tissues During Endotoxemia J. Immunol., June 15, 2000; 164(12): 6322 - 6331. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Bork Powers and Pitfalls in Sequence Analysis: The 70% Hurdle Genome Res., April 1, 2000; 10(4): 398 - 400. [Full Text] |
||||
![]() |
J.-M. Revest, L. DeMoerlooze, and C. Dickson Fibroblast Growth Factor 9 Secretion Is Mediated by a Non-cleaved Amino-terminal Signal Sequence J. Biol. Chem., March 10, 2000; 275(11): 8083 - 8090. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Hanasaki, T. Ono, A. Saiga, Y. Morioka, M. Ikeda, K. Kawamoto, K.-i. Higashino, K. Nakano, K. Yamada, J. Ishizaki, et al. Purified Group X Secretory Phospholipase A2 Induced Prominent Release of Arachidonic Acid from Human Myeloid Leukemia Cells J. Biol. Chem., November 26, 1999; 274(48): 34203 - 34211. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Xu and B. Dahlback A Novel Human Apolipoprotein (apoM) J. Biol. Chem., October 29, 1999; 274(44): 31286 - 31290. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. H. Jongbloed, U. Martin, H. Antelmann, M. Hecker, H. Tjalsma, G. Venema, S. Bron, J. M. van Dijl, and J. Muller TatC Is a Specificity Determinant for Protein Secretion via the Twin-arginine Translocation Pathway J. Biol. Chem., December 22, 2000; 275(52): 41350 - 41357. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

dSi) are shown for each position in the sequence, and the true cleavage site is marked with an arrow. In this example with two C-score peaks, the cleavage site would be incorrectly predicted when relying on the C-score alone, but the combined Y-score is able to predict it correctly. (Note: the C-score is defined to be high for the position immediately after the cleavage site, i.e. the first position in the mature protein.)

































