Protein Engineering, Vol. 14, No. 4, 227-231,
April 2001
© 2001 Oxford University Press
An approach to improving multiple alignments of protein sequences using predicted secondary structure
1 Discovery Chemistry, SmithKline Beecham Pharmaceuticals, New Frontiers Science Park, Third Avenue, Harlow, Essex CM19 5AW and 3 Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, Lincoln's Inn Fields, P.O. Box 123, London WC2A 3PX, UK
| Abstract |
|---|
|
|
|---|
The object of this work was to improve multiple sequence alignments using public-domain software and methods as far as possible. A method is described where the secondary structure of proteins is predicted and this information, coupled with a simplified description of the amino acids, is used to produce multiple sequence alignments. This method improved the accuracy of the resulting alignments by between 5 and 14% when compared with full sequence profile alignments (as scored against structural alignments). These improved alignments were used to predict the secondary structure of the sequences they contain. The resultant predictions were more accurate than those produced from less optimal alignments. An improvement of 6% for a three-state (helix, sheet and coil) prediction was observed when using the best alignment from the method presented here and the alignment obtained using sequence only. The method makes use of public domain software and all the associated files required to repeat the work are available from the primary author.
Keywords: alignment/predicted/sequence/structure
| Introduction |
|---|
|
|
|---|
One of the most important techniques in bioinformatics and homology modelling is the alignment of multiple protein sequences. Conserved residues or patterns allow the scientist to infer the structure and/or function of a protein or family of proteins. The importance of the alignment for modelling structures by homology has been exemplified by the results from the CASP2 (Marchler-Bauer and Bryant, 1997
Methods of aligning protein sequences [such as ClustalW (Thompson et al., 1994
), the HMMER package (Eddy, http://hmmer.wustl.edu) for Hidden Markov Models (Eddy, 1996
) and Psi-Blast (Altschul et al., 1997
)] tend to rely upon the amino acid types themselves and do not include other information which may be available. This approach works well when the proteins in question are closely related but breaks down as the sequence similarity decreases. Where the sequence similarities approach the `twilight' range of below 30% (Rost, 1999
) the resulting alignments are generally poor, hence additional information might be expected to aid the alignment process.
One choice of additional information would be secondary structure as much work has concentrated on designing and improving prediction algorithms (Chou and Fasman, 1974
; Garnier et al., 1978
; Zvelebil et al., 1987
; Luthy, 1991; Rost and Sander, 1993
; Biou et al., 1995; Mehta, 1995; King and Sternberg, 1996
; Lemer et al., 1996
; Frishman and Argos, 1997
). The problem with this choice is that current secondary structure prediction methods such as PSIPRED (Jones,
http://globin.bio.warwick.ac.uk/psipred/) used by Jones at CASP3 appear to peak at around 77% accuracy for a three-state prediction. A three-state prediction is a measure of how accurately helix, sheet and coil are predicted, expressed as a percentage of the known secondary structure. If the sequence and predicted secondary structure information could be combined, it may be possible to overcome the flaws of one source by augmenting it with information from the other.
It was considered important that the computational tools employed in this work were readily available in the public domain and that the implementation should be within the grasp of scientists in the area.
| Materials and methods |
|---|
|
|
|---|
Overview
This work required protein superfamilies composed of at least two families, with each family containing more than one member. A superfamily is defined as being a set of proteins which are related by homology. A family is defined as a set of proteins which are related by sequence identity to form a distinct group within the protein superfamily. The SCOP (Murzin et al., 1995
) database was used to select superfamilies to be used so as to enable all results to be tested against the experimentally determined data. Test sets were chosen from across all four fold classes (as defined by the SCOP database): A, B, A + B, A/B (for the list of those families chosen, see Table I
). Each family within a superfamily was aligned using ClustalW and each of these alignments was converted to a form which incorporated predicted secondary structure information. These new alignments were then aligned against one another using a custom matrix and the alignments produced converted back into the correct amino acid alphabet. The alignments were scored versus a structural alignment and also used as input to DSC so that further secondary structure predictions could be made.
|
Alignment and prediction programs
Secondary structure predictions were carried out using the program DSC (King and Sternberg, 1996
), multiple sequence alignments by the program ClustalW (Thompson et al., 1994
) and the Hidden Markov Model work using the HMMER2 suite of programs (Eddy, (1998
) http://hmmer.wustl.edu).
Alignment benchmarks
Structural alignments of the proteins under examination were generated by the STAMP program (Russell and Barton, 1992
). When comparing any of the multiple sequence alignments generated with the structural alignments, only those regions of structural equivalence as identified by STAMP were examined. Many measures of alignment similarity were investigated with the most useful being deemed to be the sum of all correct pairs in the query alignment divided by the maximum possible correct pairs (as identified from the STAMP structural alignment).
For the purposes of this work, all areas of the secondary structure predictions were examined. The accuracy can be thought of in two ways. If the type of regular secondary structure is ignored and only its position considered, one can measure how well the prediction algorithms detect where there is coil and where there is not (coil being irregular and non-periodic), i.e. two-state accuracy. In addition to this positional score, one may also consider whether the type of periodic/regular secondary structure (i.e. helix or sheet) at these positions of non-coil is predicted correctly (i.e. three-state accuracy).
Simplification schemes
Simplification schemes have been proposed in the past as a means of grouping together amino acids possessing similar properties. Three simplification schemes were chosen, two from the literature and one a personally devised (PD) scheme.
Scheme 1: Taylor scheme (Taylor, 1986
)
AGS, CP, DE, ILV, KMNQRT, FHWY
Scheme 2: Smith scheme (Smith and Smith, 1990
)
DE, KRH, NQ, ST, ILV, FWY, C, M, AG, P
Scheme 3: PD scheme
AMLVI (lipophilic), GP (initiating/terminating), HWFY (aromatic), KDRE (charged), QNST (polar), C (disulphide bridge forming)
Comparison with other approaches
To gauge how well the method presented here works against one of the current `state-of-the-art' methods, the selected superfamilies of proteins were also aligned using Hidden Markov Models. In some preliminary work, the different ways of implementing the HMMER2 package were examined and the method that performed best chosen to be used in all subsequent work. The HMMER2 package is implemented by first training a model on the largest of the family alignments (produced by ClustalW) and aligning the members of the other, smaller subfamily to this first alignment.
Algorithm and matrices
The Taylor and PD amino acid simplification schemes consist of six groups of amino acids whilst the Smith scheme consists of 10 groups. To remain within the 20x20 matrix which ClustalW uses, any matrices used with the Smith scheme may only consider two structural states for each amino acid grouping. The structural states chosen were regular/periodic (helix or sheet) and irregular/non-periodic (coil). This leads to two states for each group which completely fills a 20x20 matrix. Twenty matrices were designed heuristically to consider different weightings of residue type and secondary structure conservation.
Matrices for the two schemes which contain only six groups can employ either a two-state or a three-state description. A three-state scheme (where the structural types are coil, helix and sheet) produces an 18x18 matrix whilst a two-state scheme produces a 12x12 matrix.
The matrices developed are depicted in Figure 1
, which represents matrices for secondary structure matching and Taylor simplification group matching and how they lead to the final matrix for this work. Each amino acid in a sequence can have two or three states depending upon which choice of secondary structure representation we have chosen. By varying the scores for the two-state or three-state matrix we can favour one match of secondary structure over another. The three states are labelled H (helix), E (sheet) and C (coil). Similarly, for the groups of amino acids described by the Taylor paper we can favour one group over another by varying the values in the matrix. Both this matrix and the secondary structure matrix can be made to favour exact matches by making the leading diagonal values higher than off-diagonal values. By combining these two matrices we arrive at a complete matrix for this work as shown in the figure: the matrices are simply combined such that the score for any group match is affected by a score related to the type of secondary structure present. The figure shows the matrix for a Taylor three-state approach. Each Taylor group can be in one of three states and so by weighting the elements in this matrix we can favour secondary structure matches, amino acid group matches or both. Again, the leading diagonal controls the scores for an exact match of residues. By varying the combinations of these high scoring elements very diverse matrices can be constructed which favour different matches during alignment.
|
Overview of the process
- Identify suitable protein superfamily At least two families must be present within the superfamily. Each family must have more than one member to be considered.
- Calculate identities between members of a family to identify those proteins that are related to the others in the family by more than 30% (see Introduction for an explanation of this step) Use ClustalW to calculate all identities.
- Align sequences of family members having
30% identity Use ClustalW in default mode with the Gonnet series of matrices (Gonnet et al., 1992
) to produce a profile alignment for each family. The actual matrix used depends upon how similar the sequences to be aligned at this alignment step are and is automatically determined by the program.
- Align remaining sequences of families having <30% identity to the family profile alignments in an iterative manner The next highest sequence is added until all have been aligned. This step is also performed using ClustalW, the sequences being added to alignment produced by the previous step.
- Predict secondary structure for each protein using only its amino acid sequence and not an alignment. These are produced using the program DSC.
- Convert the family profile alignments to a form that includes predicted secondary structure information. This is accomplished using predicted secondary structure and one of the amino acid simplification schemes considered in this work (e.g. Taylor, Smith).
- Align the two family profile alignments using the matrices designed for the simplification scheme chosen to produce a superfamily alignment.
- Convert the superfamily alignment back to full sequence.
- Compare the full sequence superfamily alignment with the structural alignment.
- Align sequences using HMMER2 and compare with the structural alignment using the criteria described in this paper.
- Align sequences using the program ClustalW and the Gonnet matrices then compare with the structural alignment.
- Align the profile alignments using the program ClustalW and the Gonnet matrices then compare with the structural alignment.
- Cross-validate the data produced: The methods examined here were applied to 13 of the 14 protein families under consideration and the results obtained. By examining the results of this work one can identify the method which performs the best.
- Predict the secondary structure using the full sequence family alignment and compare with the experimentally determined structure.
| Results |
|---|
|
|
|---|
The results of the cross-validation work show that the two-state secondary structure predictions perform better than the three-state (for the PD and Taylor schemes where both two- and three-state are considered) and that the Smith two-state simplification scheme is the best overall by a small margin (cross-validation data not shown).
Table II
shows the results obtained from all the alignment methods examined. The scores are expressed as a fraction of the confidently aligned sections of the structural alignment (as calculated by STAMP and subsequently verified manually). A score of zero would indicate that none of the structural alignment was reproduced and a score of one that the structural alignment was reproduced exactly. `HMMER2' represents the scores obtained using the HMMER2 package and `Sequence' the alignment using the primary sequence (the standard 20 amino acid set) information only. `Profile' refers to the result of aligning the family alignments using primary sequence information only and the profile alignment routine of ClustalW. From the scores quoted one can see that the Smith two-state method is the most successful of those examined here.
|
In earlier work not reproduced here, the proteins were aligned using the simplified amino acid groupings without any structural predictions. In nearly all cases the alignments were at least as inaccurate as those produced using the standard 20 amino acid set. The experimentally determined secondary structure (as produced by DSSP) was used in the earlier work in the same way as predicted structure was used here. It was found that very good alignments could be obtained but interestingly and understandably the matrices that scored well were not those that scored best when using predicted secondary structure.
Table III
shows the results of the secondary structure predictions obtained using DSC. It shows the prediction accuracy expressed as a percentage when using the single sequences, the profile alignment obtained from ClustalW in its default mode and the alignment produced by the matrices developed in this work (averaged over all 14 families) as input to DSC. The scores are quoted for both two-state (coil and non-coil) and three-state (helix, sheet and coil) descriptions of secondary structure.
|
Table IV
|
| Discussion |
|---|
|
|
|---|
The work presented here shows that the inclusion of secondary structural information when aligning proteins leads to improved accuracy. The choice of simplification scheme, matrix and whether to consider a two-state or three-state secondary structure description has a profound effect upon the accuracy of the resulting alignment. We have found that the improvement in alignment accuracy using our method results in an improved secondary structure prediction for each protein. Whilst it is obvious that this should be the case, it was not obvious as to whether the prediction algorithm would perform better on a more accurate alignment. One technique which makes use of predicted secondary structure is fold recognition. The quality of the secondary structure prediction is a major factor in the success or failure of fold recognition techniques (Rost, 1995
From Table III
one can see that the position of non-coil structure is predicted more accurately than the type of non-coil structure when three-state prediction methods are used. Although the work here has used the program DSC to produce structure predictions, other methods and algorithms such as PHD (Rost and Sander, 1993
), Predator (Frishman and Argos, 1997
) and the Quadratic Logistic Server (Di Francesco et al., 1995
) were briefly examined (results not shown here) and the same observations made. These results and the figures in Table II
suggest that concentrating on producing an accurate two-state prediction may lead to better results than the more usual three-state. Whilst the secondary structure prediction scores quoted in Table III
are not as good as those quoted for the more recent methods such as PSIPRED, it should be noted that the predictions used here initially were less accurate than is now achievable. As it is not specific to DSC, more accurate secondary structure predictions may increase the alignment accuracy obtained by the method presented here still further.
Another method with the same aim of incorporating secondary structure information has been published by Heringa (1999). Heringa's work differs from that detailed here in that the full amino acid alphabet is retained and three amino acid exchange matrices are used for each of the three secondary structures considered (helix, sheet and coil). There is also some filtering of the predictions used within the method, with the predictions being obtained using the SSPRED technique. Gap penalties are also varied for each of the three secondary structure states predicted. Heringa's method was applied to the sequence sets of the flavodoxin and cupredoxin protein families only. This, coupled with the absence of scores representing the amount of correct alignment achieved by the method, unfortunately makes it impossible to compare the relative effectiveness of the two methods.
Homology modelling is another area where this work has implications. The initial alignment used to construct a homology model is crucial to the accuracy of the final model and any errors at this point are magnified by each subsequent step. The CASP competitions, of which CASP3 [see Sternberg et al. (1999), http://PredictionCenter.llnl.gov/casp3/ and Proteins, 1997, Suppl., 1230] is the most recent, highlight this. It has been shown that the single most important step in the comparative (homology) modelling section is the sequence alignment (Sternberg et al., 1999
). If the secondary structure considerations are taken into account during the alignment stage, it is possible to build better models and to prevent regions of different secondary structural type being aligned with one another.
This work has looked at the scenario where none of the three-dimensional structures within a superfamily are known yet has been able to increase the alignment accuracy by incorporating predicted secondary structure. With the explosion in the number of sequences as a result of the human genome work, many superfamilies will have no structural data whatsoever and the sheer number of sequences will make automation essential. This approach manages to automate the task of aligning protein superfamily members successfully and provides a tool to work with the large numbers of new proteins.
| Notes |
|---|
2 To whom correspondence should be addressed
| References |
|---|
|
|
|---|
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller, W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Biou,V., Gilbrat,J.F., Levin, J.M., Robson,B. and Garnier,J. (1988) Protein Eng., 2, 185191.
Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 211222.[Medline]
Di Francesco,V., Munson,P.J. et al. (1995) In Proceedings of the 28th Hawaii International Conference on System Sciences. IEEE, Los Alamitos, CA, 5, pp. 285291.
Eddy,S.A. (1996) Curr. Opin. Struct. Biol., 6, 361365.[Web of Science][Medline]
Eddy,S.A. (1998) http://hmmer.wustl.edu.
Frishman,D. and Argos,P. (1997) Proteins, 27, 329335.[Web of Science][Medline]
Garnier,J., Osguthorpe,D.J. and Robson,B. (1978) J. Mol. Biol., 120, 97120.[Web of Science][Medline]
Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992) Science, 256, 14431445.
Heringa,J. (1999) Comput. Chem. (Oxford), 23, 341364.[Web of Science][Medline]
Jones,D.T. http://globin.bio.warwick.ac.uk/psipred/.
King,R.D. and Sternberg,M.J.E. (1996) Protein Sci., 5, 22982310.[Web of Science][Medline]
King,R.D., Sternberg,M.J.E. et al. (1997) CABIOS, 13, 473474.
Lemer,C., Rooman,M.J. and Wodak,S.J. (1996) Proteins, 23, 337355.
Luthy,R., McLachlan,A.D. and Eisenberg,D. (1991) Proteins: Struct. Funct. Genet., 10, 229239.[Web of Science][Medline]
Marchler-Bauer,A. and Bryant,S.H. (1997) Trends Biochem. Sci., 22, 236240.[Web of Science][Medline]
Mehta,P.K., Heringa,J.P. and Argos,P. (1995) Protein Sci. 4, 25172525.[Web of Science][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[Web of Science][Medline]
Rost,B. (1995) In Bohr,H. and Brunak,S. (eds), Protein Folds. A Distance-based Approach. CRC Press, Boca Raton, FL, pp. 132151.
Rost,B. (1999) Protein Eng. 12, 8594.
Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[Web of Science][Medline]
Russell,R.B. and Barton,G.J. (1992) Proteins: Struct. Funct. Genet., 14, 309323.[Web of Science][Medline]
Smith,F.R. and Smith,T.F. (1990) Proc. Natl Acad. Sci. USA, 87, 118122.
Sternberg,M.J.E., Bates,P.A., Kelley,L.A. and MacCallum,R.M. (1999) Curr. Opin. Struct. Biol., 9, 368373.[Web of Science][Medline]
Taylor,W.R. (1986) J. Theor. Biol., 119, 205218.[Web of Science][Medline]
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.
Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) J. Mol. Biol., 195, 957961.[Web of Science][Medline]
Received January 28, 2000; revised January 18, 2001; accepted February 15, 2001.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
P. Carter, C. A. F. Andersen, and B. Rost DSSPcont: continuous secondary structure assignments for proteins Nucleic Acids Res., July 1, 2003; 31(13): 3293 - 3295. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

