Skip Navigation



PEDS Advance Access published online on November 6, 2008

Protein Engineering Design and Selection, doi:10.1093/protein/gzn064
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
22/1/27    most recent
gzn064v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bagos, P.G.
Right arrow Articles by Hamodrakas, S.J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bagos, P.G.
Right arrow Articles by Hamodrakas, S.J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oxfordjournals.org

Prediction of signal peptides in archaea

P.G. Bagos1,2,3, K.D. Tsirigos1, S.K. Plessas1, T.D. Liakopoulos1 and S.J. Hamodrakas1

1Department of Cell Biology and Biophysics, Faculty of Biology, University of Athens, Athens 15701 2Department of Informatics with Applications in Biomedicine, University of Central Greece, Papasiopoulou 2–4, Lamia 35100, Greece

3 To whom correspondence should be addressed. E-mail: pbagos{at}biol.uoa.gr, pbagos{at}ucg.gr


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
Computational prediction of signal peptides (SPs) and their cleavage sites is of great importance in computational biology; however, currently there is no available method capable of predicting reliably the SPs of archaea, due to the limited amount of experimentally verified proteins with SPs. We performed an extensive literature search in order to identify archaeal proteins having experimentally verified SP and managed to find 69 such proteins, the largest number ever reported. A detailed analysis of these sequences revealed some unique features of the SPs of archaea, such as the unique amino acid composition of the hydrophobic region with a higher than expected occurrence of isoleucine, and a cleavage site resembling more the sequences of gram-positives with almost equal amounts of alanine and valine at the position-3 before the cleavage site and a dominant alanine at position-1, followed in abundance by serine and glycine. Using these proteins as a training set, we trained a hidden Markov model method that predicts the presence of the SPs and their cleavage sites and also discriminates such proteins from cytoplasmic and transmembrane ones. The method performs satisfactorily, yielding a 35-fold cross-validation procedure, a sensitivity of 100% and specificity 98.41% with the Matthews’ correlation coefficient being equal to 0.964. This particular method is currently the only available method for the prediction of secretory SPs in archaea, and performs consistently and significantly better compared with other available predictors that were trained on sequences of eukaryotic or bacterial origin. Searching 48 completely sequenced archaeal genomes we identified 9437 putative SPs. The method, PRED-SIGNAL, and the results are freely available for academic users at http://bioinformatics.biol.uoa.gr/PRED-SIGNAL/ and we anticipate that it will be a valuable tool for the computational analysis of archaeal genomes.

Keywords: archaea/hidden Markov model/prediction/secreted proteins/signal peptide


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
In all three domains of life (bacteria, eukarya and archaea), proteins that are destined to be exported from the cytoplasm are generally (but not exclusively) synthesized as precursor proteins, bearing a cleavable N-terminal signal sequence. The signal peptide (SP) in all cases (bacteria, eukarya and archaea) is composed of a positively charged region at the n-terminus (n-region), a hydrophobic region (h-region) that spans the membrane and a c-region of mostly small and uncharged residues ending at the characteristic cleavage site (von Heijne, 1990Go). The SP is necessary for targeting the protein to the membrane-embedded export machinery in bacteria (Driessen and Nouwen, 2008Go), Eukaryotes (Rapoport et al., 1999Go) and archaea (Pohlschroder et al., 2005Go). Upon translocation across the membrane, the SP is cleaved from the precursor via a membrane-bound signal peptidase (van Roosmalen et al., 2004Go; Tuteja, 2005Go). The enzyme is called Spase I in bacteria and orthologues are found in archaea as well as in Eukaryotes. In Eukaryotes, proteins targeted to the organelles of bacterial origin (mitochondria and chloroplasts) also contain cleavable N-terminal targeting sequences, although they are in general very different from those found in the eukaryotic or bacterial secreted proteins (von Heijne et al., 1989Go; Habib et al., 2007Go). In addition, in bacteria (as well as in chloroplasts), another major pathway has been discovered, utilizing the twin-arginine (Tat) translocase, which recognizes longer and less hydrophobic (SPs) that carry a distinctive pattern of two consecutive arginines (R-R) in the n-region (Teter and Klionsky, 1999Go; Berks et al., 2005Go; Lee et al., 2006Go). A major functional differentiation between the Sec and Tat export pathways lies in the fact that the former translocates secreted proteins unfolded through a protein-conducting channel, whereas the latter, translocates completely folded proteins using an unknown mechanism (Teter and Klionsky, 1999Go).

In bacteria, a second signal peptidase (Spase II or Lsp) has been discovered in membrane-bound lipoproteins (Sankaran and Wu, 1995Go), that cleaves shorter SPs carrying a distinctive c-region containing a conserved cysteine (von Heijne, 1989Go). The conserved cysteine is indispensable in both gram-positive and gram-negative bacteria, and is necessary for membrane anchoring. The post-translational lipid modification involves three enzymes that act sequentially: the prolipoprotein diacylglyceryl transferase (Lgt), that transfers a diacylglyceride to the cysteine sulfydryl group, the signal peptidase II (Spase II or Lsp) that cleaves the SP at the residue before the cysteine forming an apolipoprotein and the apolipoprotein N-acyltransferase (Lnt), which acylates the {alpha}-amino group of the apolipoprotein N-terminal cysteine forming the mature lipoprotein (Sankaran and Wu, 1994Go; Sankaran et al., 1995Go). Although dozens of putative lipoproteins have been identified in archaeal genomes, the absence of Spase II orthologues in archaea as well as the different post-translational modification of cysteine, have resulted in a limited level of knowledge concerning archaeal lipoproteins and a lack of experimentally verified proteins of that type. Translocation of lipoproteins through the Tat pathway has been postulated based on sequence analysis, but only recently has been proven for the Bacterium Desulfovibrio vulgaris (Valente et al., 2007Go) and the Archaeon Haloferax volcanii (Gimenez et al., 2007Go). Interestingly, in halophilic archaea, the components of the Tat pathway are essential for viability (Dilks et al., 2005Go; Thomas and Bolhuis, 2006Go) and there is evidence that Tat-dependent translocation is widely used as part of a mechanism for adaptation to extreme saline environments (Rose et al., 2002Go).

Computational prediction of secretory SPs was performed initially using weight matrices (von Heijne, 1986Go). However, Neural Networks (Nielsen et al., 1997Go; Nielsen et al., 1999Go) as well as hidden Markov models (HMM) (Nielsen and Krogh, 1998Go) introduced by the SignalP method, have been proven to be the most successful methods currently available (Menne et al., 2000Go). Recently, SignalP was retrained and, mainly due to better annotation and selection of the training set, yielded an even better accuracy (Bendtsen et al., 2004Go), whereas the program TatP has been presented offering the most accurate classification of TAT SPs (Bendtsen et al., 2005Go). A different approach has been followed in the Phobius method (Kall et al., 2004Go; Kall et al., 2007Go), where a HMM was used to predict at the same time the presence of a secretory SP and transmembrane (TM) topology of a given protein. Following this approach, the authors showed that they can minimize the number of SPs predicted as TM segments and vice versa. Concerning lipoproteins, for years, regular expression patterns were used based on the von Heijne rule (von Heijne, 1989Go), with various modifications (Madan Babu and Sankaran, 2002Go; Sutcliffe and Harrington, 2002Go; Madan Babu et al., 2006Go; Setubal et al., 2006Go). Recently, a method called Lipop was developed, which is based on HMMs and was trained exclusively on gram-negative bacteria lipoproteins (Juncker et al., 2003Go). However, the previously mentioned prediction methods have been trained on bacterial and/or eukaryal sequences, and in most cases there are different versions of the predictors aiming at capturing the distinct sequence features of the SPs of particular groups of organisms. Since very few experimentally verified SPs have been characterized from archaea, little is known about the precise characteristics of these sequences, even though there is some evidence suggesting that archaeal SPs exhibit a mixture of characteristics found in eukarya and bacteria. The first computational work on archaea was performed by Nielsen et al. (1999Go) when they applied SignalP on the genome of Methanococcus jannaschii (M. jannaschii). They used the three versions of SignalP (trained on gram-positive bacteria, gram-negative bacteria and eukarya), and identified 34 proteins where the predictions concerning the existence of the SP coincided. A more systematic evaluation was performed later by Bardy et al. (2003Go), which applied a similar procedure on 15 completely sequenced genomes of archaea, requiring though, that all the three methods would predict the same cleavage site. Although this procedure may be biased to select only proteins that share common features with the sequences found in other domains of life, the general conclusions of these studies suggested that archaeal SPs exhibit a more eukaryotic-like cleavage site (c-region), and a unique h-region resembling the bacterial ones, with a slight over-representation of leucine and isoleucine; leucine is by far the dominant residue in Eukaryotes. Thus, it is evident now that SP predictors trained on eukaryal or bacterial proteins cannot reliably be applied to archaeal sequences. A dedicated prediction method is needed that would be trained exclusively on archaeal SPs. The major problem in this respect is the lack of a large number of experimentally verified signal sequences of archaeal origin. In particular, the Uniprot database (Wu et al., 2006Go) lists only 12 archaeal sequences with experimentally verified, precise locations of the cleavage site, and the specialized database of SPs SPDB (Choo et al., 2005Go) lists only nine such proteins.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
Hidden Markov model

The HMM that we used is similar to the one proposed by SignalP (Nielsen and Krogh, 1998Go). It consists of three different sub-models, the SP sub-model corresponding to the secretory SPs, the N-terminal TM sub-model corresponding to the N-terminal TM segment domain, and a globular sub-model used to model the globular N-terminal domains of cytoplasmic or membrane proteins. The central core of the model is the SP sub-model (Fig. 1). It is used to capture the modular nature of SPs, modeling the positively charged n-region, the hydrophobic h-region that spans the membrane and the c-region of mostly small and uncharged residues ending at the characteristic cleavage site (A-X-A) (von Heijne, 1990Go). The TM sub-model, is identical to the one used by the HMM-TM predictor for alpha-helical membrane proteins (Bagos et al., 2006Go), whereas the globular sub-model consists simply of a self-transitioning state.


Figure 1
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Architecture of the HMM used to model the secretory SP sequences. Each line (top to bottom) corresponds to the n-, h- and c-region, respectively. States in the n- and h-region that share the same emission probabilities (amino acid frequencies) are depicted using the same symbol. The cleavage site is shown using a dashed vertical line between A and 1 (first amino acid of the mature protein). Allowed transitions are depicted with arrows. B and E correspond to the Begin and End states, respectively, whereas states after the cleavage site (1–5 and M) are used to model the first residues of the mature protein.

 
The model was trained using the Baum–Welch algorithm for labeled sequences (Krogh, 1994Go) and the decoding was performed using the standard Viterbi algorithm (Durbin et al., 1998Go), although more advanced techniques such as the Posterior-Viterbi decoding (Fariselli et al., 2005Go) and the Optimal Accuracy Posterior Decoder (Kall et al., 2005Go) yield nearly identical results. In addition to the Viterbi decoding which produces the optimal path of states through the model, and hence predicts simultaneously the type of the sequence (SP, TM or Globular) as well as the cleavage site (if any), we also report the S1 reliability index (Melen et al., 2003Go), which takes values in the range [0–1] and provides a useful measure of the reliability of the prediction. Given that the majority of the SPs used (discussed later) did not contain information concerning the precise cleavage site location, an ‘imputation’ or ‘re-labeling’ method had to be used. Although the location of the cleavage site in proteins with non-verified cleavage sites could be predicted by other means, we chose to train an initial model using the verified proteins, and afterwards to apply the method on the non-verified ones, performing a constrained prediction by removing the labels in the area of the cleavage site (c-region) as described earlier (Krogh et al., 2001Go; Bagos et al., 2006Go).

Data sets

As we noted earlier, the publicly available databases, such as Uniprot (Wu et al., 2006Go) and SPDB (Choo et al., 2005Go), currently contain annotated information for only a few archaeal sequences with experimentally verified precise locations of the cleavage site. Thus, we decided to perform an extensive literature search in order to identify archaeal sequences with either verified cleavage site locations, or proteins with verified SPs whose cleavage sites are not precisely known. The literature search was performed on Pubmed using terms such as ‘SP’ or ‘signal sequence’, combined with terms such as ‘archaeon’, ‘archaea’ or ‘archaebacteria’. Since this strategy yielded also a limited number of archaeal peptides, and given that in many known cases the information concerning the presence of the SP was not available in the abstract or the title of the respective papers, we used additional search terms such as ‘extracellular’, ‘extracytoplasmic’ or ‘secreted’. The full-text of the papers were downloaded and read, and the reference lists were also checked in order to identify additional studies that were missed by the initial search. The identified sequences in almost every case were retrieved from Uniprot (Wu et al., 2006Go), and were classified according to two criteria; the first is whether the protein has a verified SP cleavage site or not, and the second is whether the protein is translocated using the Tat or the Sec system. Lipoprotein SPs were removed since there are only few such examples (see Results and discussion).

Since the model is also capable of discriminating SPs from globular proteins as well as from proteins with an N-terminal TM helix, we used as negative examples 69 archaeal proteins with an annotated (proven or putative) TM segment within the first 70 amino acids having the N-terminus located in the cytoplasmic space, and 183 archaeal cytoplasmic proteins. The sequences were retrieved from Uniprot and identical sequences were removed to produce a unique set. The training and testing procedure was performed using a 35-fold cross-validation procedure. The training set was split in 35 parts having approximately the same number of SPs, TM and cytoplasmic proteins. The training procedure consisted of removing one of the 35 subsets from the training set, training the model with the remaining proteins and performing the test on the proteins of the set that was removed. This process was repeated in tandem for all the subsets in the training set, and the final prediction accuracy summarized the outcome of all independent tests. Sequences belonging to different subsets used for cross-validation not had >18 identical residues within the SP as advised by previous studies (Nielsen et al., 1997Go; Nielsen et al., 1999Go). Finally, the complete proteomes of archaea were downloaded from the NCBI ftp site at ftp://ftp.ncbi.nih.gov/.

For measures of accuracy in the binary classification problem (signal peptides versus non-SPs), we used the percentage of correctly classified positive examples (sensitivity), the percentage of correctly classified negative examples (specificity) and the Matthews' correlation coefficient (MCC) that summarizes in a single measure true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) (Baldi et al., 2000Go).


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
The extensive literature search that we performed identified in total 69 archaeal proteins with a verified SP (Table I). Among them, 24 proteins have cleavage sites that were defined precisely by direct sequencing of the N-terminus of the mature protein. The 69 proteins listed in Table I include many extracellular secreted enzymes (proteases, chitinases, amylases, etc), several surface (S-layer) proteins, a few extracellular components of ABC transporter systems, as well as some uncharacterized proteins from the two main kingdoms of archaea (Crenarchaeota and Euryarchaeota). A few sequences were discarded since they were identical in the SP sequence with others in the set (i.e. CSG_METSC which is identical to CSG_METFE and Q7LYT7_PYRWO which is identical to O08452_PYRFU) as well as one sequence (Q97X08_SULSO) for which there was evidence suggesting that it was membrane-anchored (Ferrer et al., 2005Go). Only two couples of sequences had >18 identical residues in a BLAST alignment (CSG_METJA with Q6M088_METMP and HLY_HAL17 with Q5RLZ1_NATMA) though having different cleavage sites. Thus, we decided to keep them in the training set and include them in the same subset used for cross-validation in order to be tested simultaneously (to avoid overfitting). A number of proteins with a lipoprotein SP that was either proven (Gimenez et al., 2007Go) or putative (Mattar et al., 1994Go) were also discarded. We did not try specifically to eliminate Tat SPs (the same was done in SignalP), and in total 18 such sequences are included in the set, of which four contained a verified cleavage site.


View this table:
[in this window]
[in a new window]

 
Table I. Data set of 69 experimentally verified SPs identified in this studya

 
The alignment of the SPs at their respective cleavage sites (Fig. 2) is useful in order to obtain insight into the unique sequence features of the archaeal SPs. The sequence logos (Schneider and Stephens, 1990Go; Crooks et al., 2004Go) in Fig. 2 reveal the similarities and differences between the experimentally verified SPs of archaea, Eukaryotes, gram-positive and gram-negative bacteria [data for Eukaryotes and bacteria were taken from the set of SignalP (Nielsen et al., 1997Go)]. We can see that at position-1 (just before the cleavage site), alanine (A) is the dominant amino acid, although glycine (G) and serine (S) are also present in significant proportions. Alanine is also the dominant amino acid in all organism groups, though in Eukaryotes other amino acids are more easily tolerated compared with bacteria. At position-3, alanine is also the dominant amino acid, however, valine (V) is also almost equally represented in archaea followed by serine, isoleucine (I) and threonine (T). Taken together, these features suggest that the archaeal cleavage site resembles more closely that of gram-positive bacteria signals, although some resemblance to the eukaryal ones is visible. In the h-region of archaeal SPs, alanine, leucine and isoleucine are almost equally abundant whereas valine is less frequent, a feature that is unique to the archaeal domain. In eukaryal SPs, leucine is clearly the dominant amino acid (followed by equal amounts of alanine and valine) whereas in bacteria alanine and leucine are almost equally present. In both cases isoleucine is under-represented, in contrast with what is seen in archaea. Furthermore, the c-region contains mostly small and uncharged residues (serine, glycine, threonine and proline), whereas in the n-region Lysine is slightly more frequent than arginine despite the presence of 18 Tat SPs in the training set. Some of these observations were touched on in earlier works (Nielsen et al., 1999Go; Bardy et al., 2003Go). Here these patterns are analyzed for the first time based on experimentally verified archaeal SPs rather than solely on predictions. The results suggest that archaeal SPs are of unique composition, and that there is a need for a dedicated prediction method.


Figure 2
View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Left panel (from top to bottom): the sequence logos of experimentally verified eukaryal, gram-positive, gram-negative and archaeal signal peptides (SPs), respectively, produced by WebLogo (Crooks et al., 2004Go). The experimentally verified bacterial and eukaryal SPs were retrieved from the data set of SignalP. Right panel (from top to bottom): the sequence logos of SPs found in the genome analysis of 48 archaeal genomes (see text) as predicted by SignalPv3-NN, SignalPv3-HMM, PrediSi and PRED-SIGNAL (this work), respectively. The predictions of SignalP and PrediSi correspond to proteins predicted to have the exactly the same cleavage site by different modules of the respective predictor (see text for details). Sequences are aligned to the observed or predicted cleavage site which in all cases is arbitrarily located between 35th and 36th amino acid of the alignment.

 
The results obtained in the 35-fold cross-validation procedure are listed in Table II. Our method, PRED-SIGNAL, predicts correctly all the 69 SPs and rejects correctly 248 out of the 252 cytoplasmic and TM proteins. These results correspond to 100% sensitivity and 98.41% specificity with an MCC equal to 0.964. Using the same data set, we evaluated also the various versions of the SignalP method (Nielsen et al., 1997Go; Nielsen and Krogh, 1998Go; Nielsen et al., 1999Go; Bendtsen et al., 2004Go), Phobius (Kall et al., 2004Go; Kall et al., 2007Go) and PrediSi (Hiller et al., 2004Go), which is another popular and accurate SP predictor based on position specific scoring matrixs (PSSMs). The method developed here clearly outperforms all the currently available top-scoring predictors. This was expected, since none of them was trained specifically to recognize archaeal SPs. In absolute numbers, the method is very accurate and is comparable with, if not better than, the currently top-scoring method SignalP. SignalP, when trained and independently tested on gram-positive bacteria, gram-negative bacteria, and Eukaryotes respectively, reports sensitivities ranging from 92 to 99%, specificities ranging from 85 to 93% and MCCs ranging from 0.87 to 0.92, when only cytoplasmic proteins are used as negative examples (Nielsen et al., 1997Go; Bendtsen et al., 2004Go). When proteins with an N-terminal TM segment are included in the test-set, the specificity drops <90%, as was shown in an earlier evaluation study (Menne et al., 2000Go). From Table II, it is also clear that among predictors trained on data sets of origin other than archaea, those trained on gram-positive bacteria perform better in predicting archaeal signal sequences, a fact that can be explained by the composition of the c-region in archaeal SPs discussed earlier. Of these methods, only SignalPv3-NN trained on gram-positive bacteria compares with the method that we developed, having a slightly better specificity but, nevertheless, a lower sensitivity and overall performance (MCC).


View this table:
[in this window]
[in a new window]

 
Table II. Results obtained from PRED-SIGNAL using the cross-validation procedure on the set of 69 experimentally verified SPs and on 69 TM and 183 cytoplasmic archaeal proteinsa

 
Furthermore, the results obtained by using a combination of different SP predictors (i.e. the SignalP modules trained on Eukaryotes, gram-positive and gram-negative bacteria) illustrate the difficulties of such an approach. It is clear that although such an approach increases the specificity of the selection (i.e. few FPs), the sensitivity decreases (i.e. more FNs). Thus, this strategy (which was until now the only option), reliably predicts some SPs but at the same time overlooks a large number of true SPs. Some general conclusions could also been drawn from these results, verifying previous studies. As we noted earlier, methods trained on gram-positive bacteria (SignalPv2, SignalPv3 and PrediSi) perform slightly better compared with their gram-negative counterparts and clearly better compared with the Eukaryotic-based ones. Phobius, which was trained on a mixed set of proteins (gram-positive, gram-negative and Eukaryote), performs well also, but places lower than methods trained on gram-positive bacteria as well as methods trained on gram-negative bacteria. HMM methods that were trained to discriminate N-terminal TM regions from SPs (Phobius, SignalP-HMM) perform better in terms of specificity compared with Neural Networks and PSSM methods (SignalP-NN, PrediSi). On the other hand, Neural Network-based methods (SignalP-NN) are better in predicting the precise cleavage site location (data not shown). Finally, the updated versions of SignalP (SignalPv3) perform in general better compared with the older versions (SignalPv2).

We also analyzed 48 currently available archaeal completely sequenced genomes. The combined prediction of the three HMM predictors of SignalPv3 (gram-positive, gram-negative and Eukaryotic) produced in total 6145 proteins with a SP, of which 2306 proteins have the same predicted cleavage site for all three methods. The combination of the NN predictors of SignalPv3 yielded 5473 predictions in total of which 2037 have the same prediction for the cleavage site. On the contrary, the method developed here predicts in total a much larger number of proteins with signal sequences, 9437 in all. Among these proteins, according to their annotation the largest group consisted of 5351 hypothetical proteins (56.7%), followed by 1408 (14.92%) enzymes such as lipases, hydrolases, transferases, proteases, kinases, reductases, etc, of which 127 were probable, putative or predicted. There were also 832 (8.81%) membrane proteins such as permeases, transporters, etc of which 82 were probable, putative or predicted and 1024 (10.85%) extracellular proteins (mostly solute-binding components of ABC transport systems, as well as S-layer and flagellar proteins) of which 43 were probable, putative or predicted. Finally, there were 822 proteins that could not be classified (8.71%).

The detailed results for each genome are available as Supplementary data in our web site (http://bioinformatics.biol.uoa.gr/PRED-SIGNAL/). The per-genome percentage of predicted proteins carrying a SP according to our method, ranges from 5 to 14% (average = 8.92%) whereas the same percentage according to the combination of SignalP predictors ranges from 3 to 7%. According to our results, the 15 archaeal genomes belonging to Crenarchaeota do not differ significantly from the 32 genomes belonging to Euryarchaeota (8.54 versus 9.16%, P-value = 0.406 according to t-test) concerning the proportion of proteins predicted to contain a SP. The only representative of Nanoarchaeota (Nanoarchaeum equitans) contains a comparable proportion of secreted proteins (7.09%) although produced by a significantly smaller genome (38 out of the 536 total coding sequences). In an ANOVA analysis, psychrophiles, mesophiles, thermophiles and hyperthermophiles did not show any statistical difference concerning the proportion of proteins carrying a SP (range from 8.2 to 10.7%, P-value = 0.087). Only the six thermoacidophiles showed a smaller proportion (6.58%), whereas one haloalkalophile (13.8%) and the three halophiles (12.53%) showed larger proportions. The amino acid distribution of SPs of all the groups examined using sequence logos did not detect any obvious discrepancies (data not shown). The only detectable difference was the over-representation of alanine and glycine and the under-representation of isoleucine in the h-region of SPs of halophiles and haloalkalophiles. These results need to be studied further, but clearly the large proportion of secreted proteins as well as the abundance of glycine and alanine that suggest a lower hydrophibicity in the h-region of SPs of halophiles, should be attributed to the extensive use of the Tat pathway. PRED-SIGNAL does not discriminate Tat from Sec SPs, and we expect a lot of the secreted proteins of halophiles to contain a Tat SP (Rose et al., 2002Go).

Among the proteins predicted by the combination of the HMM versions of SignalP, only 685 were not predicted by our predictor, and among the proteins predicted by the combination of the NN versions of SignalP, 749 were not predicted as having a SP by PRED-SIGNAL. Thus, the HMM method developed here is very specific in detecting putative SPs that are considered highly probable (as judged by the stringent criteria applied by the combination of the SignalP predictors). On the other hand, PRED-SIGNAL predicts an additional large number of proteins that were selected by only one or two modules of SignalP, and a remarkably large number of proteins that were not selected by either one of the versions of SignalP (1039 for the HMM versions and 1139 for the NN versions). This highlights that although the stringent criteria applied by combining the different predictors of SignalP can indeed select a large number of archaeal SPs sharing common features with bacterial and eukaryotic SPs, an additional large number of putative SPs exist that possess some unique features not present in SPs of eukaryotic or bacterial origin. As expected from the analysis of the training set, the largest agreement of the individual SignalP-NN modules with PRED-SIGNAL is to the gram-positive module (correlation coefficient = 0.646), followed by the gram-negative and Eukaryotic modules. Similar, although not identical, results hold also for the SignalP-HMM predictors (data not shown).


    Conclusions
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
In this work, we present a first computational method that specifically predicts the SPs of archaeal origin and their cleavage sites. We performed an extensive literature search in order to identify SPs with experimentally verified cleavage sites, as well as verified SPs in which the cleavage site is not precisely located. The analysis confirms previous results that suggested a unique composition of archaeal SPs and justifies our approach for modeling separately the particular sequences. We used an HMM approach, and trained the model to discriminate secretory SPs from cytoplasmic proteins as well as from proteins with an N-terminal TM segment, as these segments are often confused by predictors. The prediction method was also applied to the currently available completely sequenced genomes of archaea, and the results were compared with those of SignalP, which is considered to be the most accurate predictor of non-archaeal sequences. The new prediction method, PRED-SIGNAL, and the secreted proteins identified in the genome analysis are available online at: http://bioinformatics.biol.uoa.gr/PRED-SIGNAL/. We anticipate that this method will be a useful tool for those studying secreted proteins of archaea, since it could be used in genome annotation, genome-wide analyses, and for various proteomics applications. Finally, we note that the modular nature of the HMM allows easily the extension of the model, i.e. in order to incorporate joint prediction of Tat SPs or lipoprotein SPs. In our data set we have included 18 Tat substrates, and we found not >10 archaeal lipoproteins. However, when further experimental data become available on these classes of SPs in the near future, the model’s architecture could be easily expanded in order to include them and allow better discrimination capability.


    Funding
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
P.G.B. was supported by a scholarship from the State Scholarships Foundation of Greece (SSF), for post-doctoral research in the Department of Cell Biology and Biophysics of the University of Athens (Machine Learning Algorithms for Bioinformatics).


    Footnotes
 
Edited by Todd Yeates


    Acknowledgements
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
The authors would like to thank the two anonymous reviewers and the editors for their very helpful comments and the constructive criticism that helped in the improvement of the manuscript.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 Funding
 Acknowledgements
 References
 
Akca E., Claus H., Schultz N., Karbach G., Schlott B., Debaerdemaeker T., Declercq J.P., Konig H. Extremophiles (2002) 6:351–358.[CrossRef][Medline]

Alber B.E., Ferry J.G. Proc. Natl Acad. Sci. USA (1994) 91:6909–6913.[Abstract/Free Full Text]

Albers S.V., Driessen A.M. Arch. Microbiol. (2002) 177:209–216.[CrossRef][Web of Science][Medline]

Bagos P.G., Liakopoulos T.D., Hamodrakas S.J. BMC Bioinformatics (2006) 7:189.[CrossRef][Medline]

Baldi P., Brunak S., Chauvin Y., Andersen C.A., Nielsen H. Bioinformatics (2000) 16:412–424.[Abstract/Free Full Text]

Bardy S.L., Eichler J., Jarrell K.F. Protein Sci. (2003) 12:1833–1843.[CrossRef][Web of Science][Medline]

Bauer M.W., Driskill L.E., Callen W., Snead M.A., Mathur E.J., Kelly R.M. J. Bacteriol. (1999) 181:284–290.[Abstract/Free Full Text]

Bendtsen J.D., Nielsen H., von Heijne G., Brunak S. J. Mol. Biol. (2004) 340:783–795.[CrossRef][Web of Science][Medline]

Bendtsen J.D., Nielsen H., Widdick D., Palmer T., Brunak S. BMC Bioinformatics (2005) 6:167.[CrossRef][Medline]

Berks B.C., Palmer T., Sargent F. Curr. Opin. Microbiol. (2005) 8:174–181.[CrossRef][Web of Science][Medline]

Brockl G., Behr M., Fabry S., Hensel R., Kaudewitz H., Biendl E., Konig H. Eur. J. Biochem. (1991) 199:147–152.[Web of Science][Medline]

Brown S.H., Kelly R.M. Appl. Environ. Microbiol. (1993) 59:2614–2621.[Abstract/Free Full Text]

Bult C.J., et al. Science (1996) 273:1058–1073.[Abstract]

Catara G., Ruggiero G., La Cara F., Digilio F.A., Capasso A., Rossi M. Extremophiles (2003) 7:391–399.[CrossRef][Medline]

Cheung J., Danna K.J., O’Connor E.M., Price L.B., Shand R.F. J Bacteriol. (1997) 179:548–551.[Abstract/Free Full Text]

Chong P.K., Wright P.C. J. Proteome Res. (2005) 4:1789–1798.[CrossRef][Web of Science][Medline]

Choo K.H., Tan T.W., Ranganathan S. BMC Bioinformatics (2005) 6:249.[CrossRef][Medline]

Cohen G.N., et al. Mol. Microbiol. (2003) 47:1495–1512.[CrossRef][Web of Science][Medline]

Comfort D.A., Chou C.J., Conners S.B., VanFossen A.L., Kelly R.M. Appl. Environ. Microbiol. (2008) 74:1281–1283.[Abstract/Free Full Text]

Crooks G.E., Hon G., Chandonia J.M., Brenner S.E. Genome Res. (2004) 14:1188–1190.[Abstract/Free Full Text]

Dharmavaram R., Gillevet P., Konisky J. J. Bacteriol. (1991) 173:2131–2133.[Abstract/Free Full Text]

Dilks K., Gimenez M.I., Pohlschroder M. J. Bacteriol. (2005) 187:8104–8113.[Abstract/Free Full Text]

Driessen A.J., Nouwen N. Annu. Rev. Biochem (2008) 77:643–667.[CrossRef][Web of Science][Medline]

Duffner F., Bertoldo C., Andersen J.T., Wagner K., Antranikian G. J. Bacteriol. (2000) 182:6331–6338.[Abstract/Free Full Text]

Durbin R., Eddy S.R., Krogh A., Mithison G. Biological Sequence Analysis (1998) Cambridge University Press.

Erra-Pujada M., Debeire P., Duchiron F., O’Donohue M.J. J. Bacteriol. (1999) 181:3284–3287.[Abstract/Free Full Text]

Fariselli P., Martelli P.L., Casadio R. BMC Bioinformatics (2005) 6(Suppl. 4):S12.

Ferrer M., Golyshina O.V., Plou F.J., Timmis K.N., Golyshin P.N. Biochem. J. (2005) 391:269–276.[CrossRef][Web of Science][Medline]

Gimenez M.I., Dilks K., Pohlschroder M. Mol. Microbiol. (2007) 66:1597–1606.[Web of Science][Medline]

Goldman S., Hecht K., Eisenberg H., Mevarech M. J. Bacteriol. (1990) 172:7065–7070.[Abstract/Free Full Text]

Habib S.J., Neupert W., Rapaport D. Methods Cell Biol. (2007) 80:761–781.[CrossRef][Web of Science][Medline]

Hashimoto Y., Yamamoto T., Fujiwara S., Takagi M., Imanaka T. J. Bacteriol. (2001) 183:5050–5057.[Abstract/Free Full Text]

Hiller K., Grote A., Scheer M., Munch R., Jahn D. Nucleic Acids Res. (2004) 32:W375–W379.[Abstract/Free Full Text]

Hutcheon G.W., Vasisht N., Bolhuis A. Extremophiles (2005) 9:487–495.[CrossRef][Medline]

Izotova L.S., Strongin A.Y., Chekulaeva L.N., Sterkin V.E., Ostoslavskaya V.I., Lyublinskaya L.A., Timokhina E.A., Stepanov V.M. J. Bacteriol. (1983) 155:826–830.[Abstract/Free Full Text]

Jones R.A., Jermiin L.S., Easteal S., Patel B.K., Beacham I.R. J. Appl. Microbiol. (1999) 86:93–107.[CrossRef][Medline]

Juncker A.S., Willenbrock H., Von Heijne G., Brunak S., Nielsen H., Krogh A. Protein Sci. (2003) 12:1652–1662.[CrossRef][Web of Science][Medline]

Kall L., Krogh A., Sonnhammer E.L. J. Mol. Biol. (2004) 338:1027–1036.[CrossRef][Web of Science][Medline]

Kall L., Krogh A., Sonnhammer E.L. Bioinformatics (2005) 21(Suppl. 1):i251–i257.[Abstract]

Kall L., Krogh A., Sonnhammer E.L. Nucleic Acids Res. (2007) 35:W429–W432.[Abstract/Free Full Text]

Kamekura M., Seno Y., Holmes M.L., Dyall-Smith M.L. J. Bacteriol. (1992) 174:736–742.[Abstract/Free Full Text]

Kamekura M., Seno Y., Dyall-Smith M. Biochim. Biophys. Acta (1996) 1294:159–167.[CrossRef][Medline]

Kannan Y., Koga Y., Inoue Y., Haruki M., Takagi M., Imanaka T., Morikawa M., Kanaya S. Appl. Environ. Microbiol. (2001) 67:2445–2452.[Abstract/Free Full Text]

Kashima Y., Mori K., Fukada H., Ishikawa K. Extremophiles (2005) 9:37–43.[CrossRef][Medline]

Kawarabayasi Y., et al. DNA Res. (1998) 5:55–76.[Abstract]

Kawarabayasi Y., et al. DNA Res. (2001) 8:123–140.[Abstract]

Kim B.K., Pihl T.D., Reeve J.N., Daniels L. J. Bacteriol. (1995) 177:7178–7185.[Abstract/Free Full Text]

Krogh A. Proceedings of the12th IAPR International Conference on Pattern Recognition (1994) 140–144.

Krogh A., Larsson B., von Heijne G., Sonnhammer E.L. J. Mol. Biol. (2001) 305:567–580.[CrossRef][Web of Science][Medline]

Lechner J., Sumper M. J. Biol. Chem. (1987) 262:9724–9729.[Abstract/Free Full Text]

Lee P.A., Tullman-Ercek D., Georgiou G. Annu. Rev. Microbiol. (2006) 60:373–395.[CrossRef][Web of Science][Medline]

Leveque E., Haye B., Belarbi A. FEMS Microbiol. Lett. (2000) 186:67–71.[Web of Science][Medline]

Lim J.K., Lee H.S., Kim Y.J., Bae S.S., Jeon J.H., Kang S.G., Lee J.H. J. Microbiol. Biotechnol. (2007) 17:1242–1248.[Web of Science][Medline]

Limauro D., Cannio R., Fiorentino G., Rossi M., Bartolucci S. Extremophiles (2001) 5:213–219.[CrossRef][Medline]

Lin X., Tang J. J. Biol. Chem. (1990) 265:1490–1495.[Abstract/Free Full Text]

Madan Babu M., Sankaran K. Bioinformatics (2002) 18:641–643.[Abstract/Free Full Text]

Madan Babu M., Priya M.L., Selvan A.T., Madera M., Gough J., Aravind L., Sankaran K. J. Bacteriol. (2006) 188:2761–2773.[Abstract/Free Full Text]

Mander G.J., Duin E.C., Linder D., Stetter K.O., Hedderich R. Eur. J. Biochem. (2002) 269:1895–1904.[Web of Science][Medline]

Mattar S., Scharf B., Kent S.B., Rodewald K., Oesterhelt D., Engelhard M. J. Biol. Chem. (1994) 269:14939–14945.[Abstract/Free Full Text]

Melen K., Krogh A., von Heijne G. J. Mol. Biol. (2003) 327:735–744.[CrossRef][Web of Science][Medline]

Menne K.M., Hermjakob H., Apweiler R. Bioinformatics (2000) 16:741–742.[Abstract/Free Full Text]

Morikawa M., Izawa Y., Rashid N., Hoaki T., Imanaka T. Appl. Environ. Microbiol. (1994) 60:4559–4566.[Abstract/Free Full Text]

Nielsen H., Krogh A. Proc. Int. Conf. Intell. Syst. Mol. Biol. (1998) 6:122–130.[Medline]

Nielsen H., Engelbrecht J., Brunak S., von Heijne G. Protein Eng. (1997) 10:1–6.[Abstract/Free Full Text]

Nielsen H., Brunak S., von Heijne G. Protein Eng. (1999) 12:3–9.[Abstract/Free Full Text]

Palmieri G., Casbarra A., Fiume I., Catara G., Capasso A., Marino G., Onesti S., Rossi M. Extremophiles (2006) 10:393–402.[CrossRef][Medline]

Perez-Pomares F., Bautista V., Ferrer J., Pire C., Marhuenda-Egea F.C., Bonete M.J. Extremophiles (2003) 7:299–306.[CrossRef][Medline]

Pohlschroder M., Gimenez M.I., Jarrell K.F. Curr. Opin. Microbiol. (2005) 8:713–719.[Web of Science][Medline]

Rapoport T.A., Matlack K.E., Plath K., Misselwitz B., Staeck O. Biol. Chem. (1999) 380:1143–1150.[CrossRef][Web of Science][Medline]

Rose R.W., Bruser T., Kissinger J.C., Pohlschroder M. Mol. Microbiol. (2002) 45:943–950.[CrossRef][Web of Science][Medline]

Ruiz D.M., De Castro R.E. J. Ind. Microbiol. Biotechnol. (2007) 34:111–115.[CrossRef][Web of Science][Medline]

Sako Y., Croocker P.C., Ishida Y. FEBS Lett. (1997) 415:329–334.[CrossRef][Web of Science][Medline]

Sankaran K., Wu H.C. J. Biol. Chem. (1994) 269:19701–19706.[Abstract/Free Full Text]

Sankaran K., Wu H.C. Methods Enzymol. (1995) 248:169–180.[Web of Science][Medline]

Sankaran K., Gupta S.D., Wu H.C. Methods Enzymol. (1995) 250:683–697.[Web of Science][Medline]

Saunders N.F., Ng C., Raftery M., Guilhaus M., Goodchild A., Cavicchioli R. J. Proteome Res. (2006) 5:2457–2464.[CrossRef][Web of Science][Medline]

Schneider T.D., Stephens R.M. Nucleic Acids Res. (1990) 18:6097–6100.[Abstract/Free Full Text]

Serour E., Antranikian G. Antonie Van Leeuwenhoek (2002) 81:73–83.[CrossRef][Web of Science][Medline]

Setubal J.C., Reis M., Matsunaga J., Haake D.A. Microbiology (2006) 152:113–121.[Abstract/Free Full Text]

She Q., et al. Proc. Natl Acad. Sci. USA (2001) 98:7835–7840.[Abstract/Free Full Text]

Shi W., Tang X.F., Huang Y., Gan F., Tang B., Shen P. Extremophiles (2006) 10:599–606.[CrossRef][Medline]

Sumper M., Berg E., Mengele R., Strobel I. J. Bacteriol. (1990) 172:7111–7118.[Abstract/Free Full Text]

Sun C., Li Y., Mei S., Lu Q., Zhou L., Xiang H. Mol. Microbiol. (2005) 57:537–549.[CrossRef][Web of Science][Medline]

Sutcliffe I.C., Harrington D.J. Microbiology (2002) 148:2065–2077.[Abstract/Free Full Text]

Tanaka T., Fujiwara S., Nishikori S., Fukui T., Takagi M., Imanaka T. Appl. Environ. Microbiol. (1999) 65:5338–5344.[Abstract/Free Full Text]

Teter S.A., Klionsky D.J. Trends Cell Biol. (1999) 9:428–431.[CrossRef][Web of Science][Medline]

Thomas J.R., Bolhuis A. FEMS Microbiol. Lett. (2006) 256:44–49.[CrossRef][Web of Science][Medline]

Tuteja R. Arch Biochem. Biophys. (2005) 441:107–111.[CrossRef][Web of Science][Medline]

Valente F.M., Pereira P.M., Venceslau S.S., Regalla M., Coelho A.V., Pereira I.A. FEBS Lett. (2007) 581:3341–3344.[CrossRef][Web of Science][Medline]

van Roosmalen M.L., Geukens N., Jongbloed J.D., Tjalsma H., Dubois J.Y., Bron S., van Dijl J.M., Anne J. Biochim. Biophys. Acta (2004) 1694:279–297.[Medline]

von Heijne G. Nucleic Acids Res. (1986) 14:4683–4690.[Abstract/Free Full Text]

von Heijne G. Protein Eng. (1989) 2:531–534.[Abstract/Free Full Text]

von Heijne G. J. Membr. Biol. (1990) 115:195–201.[CrossRef][Web of Science][Medline]

von Heijne G., Steppuhn J., Herrmann R.G. Eur. J. Biochem. (1989) 180:535–545.[Web of Science][Medline]

Voorhorst W.G., Eggen R.I., Geerling A.C., Platteeuw C., Siezen R.J., Vos W.M. J. Biol. Chem. (1996) 271:20426–20431.[Abstract/Free Full Text]

Voorhorst W.G., Warner A., de Vos W.M., Siezen R.J. Protein Eng. (1997) 10:905–914.[Abstract/Free Full Text]

Wakai H., Nakamura S., Kawasaki H., Takada K., Mizutani S., Aono R., Horikoshi K. Extremophiles (1997) 1:29–35.[CrossRef][Medline]

Wang L., Zhou Q., Chen H., Chu Z., Lu J., Zhang Y., Yang S. J. Ind. Microbiol. Biotechnol. (2007) 34:187–192.[CrossRef][Web of Science][Medline]

Woodson J.D., Reynolds A.A., Escalante-Semerena J.C. J. Bacteriol. (2005) 187:5901–5909.[Abstract/Free Full Text]

Wu C.H., et al. Nucleic Acids Res. (2006) 34:D187–D191.[Abstract/Free Full Text]

Received May 17, 2008; revised September 30, 2008; accepted October 9, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Bacteriol.Home page
S. Y. M. Ng, D. J. VanDyke, B. Chaban, J. Wu, Y. Nosaka, S.-I. Aizawa, and K. F. Jarrell
Different Minimal Signal Peptide Lengths Recognized by the Archaeal Prepilin-Like Peptidases FlaK and PibD
J. Bacteriol., November 1, 2009; 191(21): 6732 - 6740.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
22/1/27    most recent
gzn064v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bagos, P.G.
Right arrow Articles by Hamodrakas, S.J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bagos, P.G.
Right arrow Articles by Hamodrakas, S.J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?