Protein Engineering, Vol. 15, No. 5, 347-352,
May 2002
© 2002 Oxford University Press
Hidden Markov Models-based system (HMMSPECTR) for detecting structural homologies on the basis of sequential information
1 Department of Chemistry and Biochemistry 0654 and 2 San Diego Supercomputer Center 0505, 3 Department of Pharmacology, University of California, San Diego, La Jolla, CA 92093, USA
| Abstract |
|---|
|
|
|---|
HMMSPECTR is a tool for finding putative structural homologs for proteins with known primary sequences. HMMSPECTR contains four major components: a data warehouse with the hidden Markov models (HMM) and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of 1015 `best' proteins from the chosen HMMs. The data warehouse contains four libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER program. The third library contains parts (`partial HMM') of initial alignments. The fourth library contains trained HMMs. We tested our program against all of the protein targets proposed in the CASP4 competition. The data warehouse included libraries of structural alignments and HMMs constructed on the basis of proteins publicly available in the Protein Data Bank before the CASP4 meeting. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases. The improvement is most notable for the targets with complexity 4 (difficult fold recognition cases).
Keywords: Hidden Markov models/HMM/protein structure prediction/secondary structure/structural alignments/tertiary structure
| Introduction |
|---|
|
|
|---|
Hidden Markov models (HMM)s have become a regular tool for many tasks in the field of bioinformatics. Profile HMM methods are increasingly used in the area of protein structure prediction. It is known that more than 80% of new protein structures with relatively small sequence similarity to solved structures nevertheless adopt an already known protein fold (Orengo et al., 1994
This paper describes a program that we have developed, HMMSPECTR, that finds putative structural homologs for proteins with known primary sequences. The foundation of HMMSPECTR is the hypothesis that the structural information in protein sequences can be extracted from structural alignments.
| Methods |
|---|
|
|
|---|
HMMSPECTR contains four major components: a data warehouse with the HMM and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of 1015 `best' proteins from the chosen HMMs (Figure 1
|
Data warehouse construction
The goal of the data warehouse construction was to cover the majority of known structures of proteins. The data warehouse of fold superfamilies was constructed using the SCOP fold classification (Murzin et al., 1995
). We created and trained a set of HMMs using the program HAMMER (Eddy, 1998
). From each fold of SCOP we selected a typical representative: a title protein. The CE program (Shindyalov and Bourne, 1998
) was used to create structural alignments of proteins that have tertiary structures close to the title protein. We considered proteins structurally close if the Z-score reported by CE was >4. Multiple pairwise alignments constructed using CE contained 50800 proteins. The number of proteins in each alignment depends on the Z-score chosen to limit similarity of these proteins to the title protein. Constructed alignments had to include all members of the selected SCOP superfamily. In many cases we chose relatively loose Z-score cut-offs (<4) to obtain multiple alignments that included some proteins with structures sufficiently close to the title protein but not included in this SCOP superfamily or family (Figure 2
). This feature was introduced to build statistically rich HMMs. Use of a narrow set of proteins structurally very close to each other in the initial HMM restricts a major advantage of the HMM approach, i.e. estimation of probabilities of transitions between neighboring amino acids. Our goal thus became finding representatives of a specific fold or set of folds instead of finding representatives of a specific family of proteins (Figure 2
). We created the HMM corresponding to each set of structural alignments.
|
We created structural alignments having as a core each superfamily for the main classes of folds: all alpha proteins (
), all beta proteins (ß), alpha and beta proteins (
/ß), alpha and beta proteins (
+ ß), multi-domain proteins (
and ß), coiled coil proteins and `small proteins'. We also created alignments for specific families, including EF hand-like (
), PHGase F-like (ß), supersandwich (ß), NAD(P)-binding Rossman-fold domains (
/ß), thioredoxin fold (
/ß), pyruvateferredoxin oxidoreductase (PFOR) domain III (
/ß), IL8-like (
+ ß) and zincin-like (
+ ß). For the globin-like fold (
) we created alignments for all protein domains in two families, globins and phycocyanins. This level of detail was needed to cover all SCOP proteins of specific subdivisions by alignments. The number of structural alignments created was 1500.
We created three libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER package and the third library contained parts (`partial HMM') of initial alignments. The first library included variants of HMM preparation with different gap-filter values from 0.1 to 0.9. The second library contained trained HMMs. The cyclic HMM training was done by using the initial HMM to create the next multiple alignment, which in turn wss used to prepare the HMM for the next step (Tsigelny et al., 2000
). This procedure converged in 35 cycles. The search procedure then selected HMMs for which the score for specific target sequences grew during the training. In many cases, training increased the score with which these HMMs made specific predictions. The third library consisted of `partial HMMs', based on the observation of significant discontinuities in both the CE Z-score and the HMM scores for members of a family. `Partial HMMs' were obtained by splitting the family at the discontinuities.
Search procedure
Each HMM from the data warehouse (including trained and untrained HMMs) is tested for concordance with the probe sequence. If the system is not able to pick one with a reasonably high score even using trained HMMs, it shifts to the search of partial HMMs. Eventually it stops when the highest score is found.
To select the best final solution we compare the secondary structure of the best 10 candidates extracted from DSSP library (Kabsch and Sander, 1983
) with the predicted secondary structure of the target protein. For secondary structure prediction we used a new method based on pattern recognition techniques (in preparation).
Table I
illustrates effectiveness of our HMM training procedures on CASP 4 protein targets T0109, T0100 and T0087. The training procedures significantly increase the scores and, even more important, the length of predicted protein structures. We have to note that training does not improve the scores and lengths in all cases. In a number of cases we do not see any improvement. This usually means that the initial HMM is prepared properly and does not need further correction.
|
| Results and discussion |
|---|
|
|
|---|
We tested our program against all of the protein targets proposed in CASP4 competition (CASP 4, 2000
The following changes were made to the program after CASP-4 meeting:
- Fundamental changes were made for the `partial hmm' library construction. This library is used only when the other prediction scores are weak. In the construction of `partial HMMs' we used sorting by `HMM-score' (alignment score between the consensus sequence of an HMM and each of the proteins in the alignment) instead of our previous initial sorting by CE Z-score (every protein versus the superfamily representative `title protein').The family is partitioned by sharp changes of HMM-scores. The range of CE Z-scores 37 is much less reliable for finding changes than HMM-scores having much broader boundaries, say from -200 to +200.
- The number of g-filters used was increased from the set of 0.4, 05, 0.6 to 0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.9. The results derived using all these filters are now stored in the data warehouse.
- We improved our scoring function by introducing dependence on the length of predicted protein sequence into the score function.
- Secondary structure prediction was introduced into the program. Currently the final score of the predicted protein structure is defined by both the HMM-score and the secondary structure correspondence score. Reliability of the secondary structure prediction is also incorporated in the score function.
The results of our tests on the CASP4 targets are shown in Tables II and III![]()
. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases.
|
|
Table III
The details of our current protein structure prediction strategy using HMM score and Secondary Structure Prediction score are given using as an example CASP-4 target T89, (PDB code 1E4F, cell division protein FtsA from Thermotoga maritima).
Primary HMM search brought the following best results from three libraries:
In the pre-CASP-4 period we would just use the best prediction, 1QHA:B(917:80462). In the new version of the program g-library with increased number of filters selected 1HLU:A as a best prediction. Nevertheless, the best prediction by HMM-score would still remain 1QHA:B. Only taking in consideration the Secondary Structure Prediction Score makes it possible to predict in fully automated mode the right structure 1HLU:A:
The final prediction is made using the minimum sum of sorting scores of both HMM and SS predictions.
The final prediction of the tertiary structure for target T89 is protein 1HLU_A.
H2M = HMMscorexLr/Lt for positive HMM scores; H2M = HMMscorexLt/Lr for negative HMM scores, where Lt = length of target protein sequence. In the case when Lr = Lt, H2M = HMMscore. Secondary structure score is calculated starting from adding 1 for the first identical letters of secondary structure of a target and predicted secondary structure, adding 0.1 for each next non-interrupted identical letter and subtracting 0.1 for each gap. The sum is multiplied by the coefficient of reliability of prediction in each case, Kpr, which has values from 0.1 to 0.9. The resulting score if also multiplied by the coefficient taking into account the length of the region of correspondence.
HMMSPECTR is successful because it explicitly allows for two of the basic problems of predicting new structures from libraries of known structures. The basic assumptions in this process are that (1) the structures are properly classified and (2) they are properly aligned. When this is the case, the normally prepared HMMs in the data warehouse correctly classify target sequences to the appropriate folds. However, the classification of structures may not be sufficiently detailed to reflect the true sequence to structure code for some folds. In this case, the `partial HMMs' subdivide the classifications based on discontinuities of similarity scores in the originally classified data. This weakens the statistical power of the HMMs, so this method is used only when the other methods have returned ambiguous results. When the structures are correctly classified but there are problems in the structural alignments, the trained HMMs will give better results. The trained HMMs are used to allow some revision of the initial structural alignment, but they are of course statistically biased. The diagnostic feature of the trained HMMs is that if the target sequence follows a similar progression of scores through the HMMs generated during the training process then it may well have similar structural behavior to the title protein despite alignment ambiguities obscuring the signal from the HMMs.
Further exploration of the value of our program was done on complex proteins for which structures are not yet available. Figure 3
shows results obtained using HMMSPECTR for structure prediction of cystic fibrosis transmembrane regulator (CFTR). The program predicted proteins consistent with the known structural domains of CFTR. There are two of the important functional domains of CFTR: NBD-1 and NBD-2 (first and second nucleotide-binding domains). HMMSPECTR predicted a correspondence between the NBD-1 region of CFTR and the tertiary structure of 2AY5 (aromatic amino acid aminotransferase). Following this unpublished prediction, the structure of part of ABC transporter protein (ATB-binding subunit of histidine permease) was solved in the laboratory of Sung-ho Kim at UC Berkeley (Hung et al., 1998
). This molecule has significant homology to NBD-1 of CFTR. The structure of ABC transporter was not present in the PDB and was not used in our preparation of initial HMMs. Nevertheless, when we received it directly from Dr Kim we constructed on its basis the homology model of NMD-1 of CFTR and then superimposed it with the tertiary structure of 2AY5.
|
Figure 4
|
|
|
|
| Notes |
|---|
4 To whom the correspondence should be addressed. E-mail: itsigeln{at}ucsd.edu
| References |
|---|
|
|
|---|
Baldi,P., Chuvin,Y., Hunkapiller,T. and McClure,M.A. (1994) Proc. Natl Acad. Sci. USA, 91, 10591063
CASP 4 (2000). Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction, Asilomar, CA.
Eddy, S (1998) Bioinformatics, 14, 755763.
Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997). Biochem. Biophys. Res. Commun., 231, 760766.[CrossRef][Web of Science][Medline]
Hung, L.W., Wang, I.X., Nikaido, K., Liu, P.Q., Ames,G.F. and Kim,S.H. (1998)Nature, 396, 703707.[CrossRef][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 12, 25772637.
Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846856
Laurents,D.V., Subbiah,S. and Levitt,M. (1994). Protein Sci., 11, 19381944.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][Web of Science][Medline]
Orengo,C., Jones,D.T. and Thornton,J.M. (1994). Nature, 372, 631634.[CrossRef][Medline]
Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739747.
Tsigelny,I., Shindyalov,P.E., Bourne, T.C., Sudhoff,T.C. and Taylor, P. (2000) Protein Sci., 9, 180185.[Web of Science][Medline]
Received July 27, 2001; revised January 2, 2002; accepted February 8, 2002.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



