Protein Engineering vol. 16 no. 12 pp. 949-955, 2003
© 2003 Oxford University Press
Protein fold comparison by the alignment of topological strings
Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK 1Present address: Department of Biochemistry and Chemistry, University of Leicester, Leicester LE1 7RH, UK
2 To whom correspondence should be addressed. e-mail: wtaylor{at}nimr.mrc.ac.uk
| Abstract |
|---|
|
|
|---|
Using the definitions of protein folds encoded in a text string, a dynamic programming algorithm was devised to compare these and identify their largest common substructure and calculate the distance (in terms of the number of edit operations) that this lay from each structure. This provided a metric on which the folds were clustered into a phylogenetic tree. This construction differs from previous automatic structure clustering algorithms as it has explicit representation of the structures at ancestral branching nodes, even when these have no corresponding known structure. The resulting tree was compared with that compiled by an expert in the field and while there was broad agreement, differences were found that resulted from differing degrees of emphasis being placed on the types of operations that can be used to transform structures. Some concluding speculations on the relationship of such trees to the evolutionary history and folding of the proteins are advanced.
Keywords: classification/evolution/fold/protein/topology/trees
| Introduction |
|---|
|
|
|---|
With the rapidly growing number of known protein structures, it becomes increasingly important to have a consistent and systematic way to classify and analyse these structures. Such analysis provides insights into the variety of structures found in nature and helps in the understanding of how they relate to one another. As a consequence, light may be shed on the enigmatic relationship between structure and sequence, including its manifestation through function, folding and evolution. A better understanding of these can then, in turn, be applied to improve the prediction of structure from sequence data.
Hierarchical clustering
Protein structures are usually analysed at the level of the domain. However, the definition of a domain is not always straightforward. Small structural differences between otherwise similar proteins can have major consequences to the way in which a protein structure is broken into domains and even within a domain such differences can alter the way in which its fold (or topology) is perceived. Expert judgement can be used to some extent to overcome these problems, but experts do not always agree.
Current systems for the classification of protein structures use different methods. SCOP (Murzin et al., 1995
) is principally a manual approach while FSSP (Holm and Sander, 1997
) is an automated method. Between these lie CATH (Orengo et al., 1997
) and HOMSTRAD (Mizuguchi et al., 1998
) which combine automation with manual curation. The basic approach, however, is the same: proteins are divided into their component domains that are then compared pairwise and clustered into groups of similar structures. This approach works well when there is clear similarity between the structures; however, for distantly related (or unrelated) structures, it becomes difficult to define a rational hierarchy on the relationships.
This difficulty is further compounded by the problem of domain definition where differences in the definitions can result in inconsistent fragmentary similarities between structures. Indeed many of the differences between the current classification systems are largely due to different domain definitions (Hadley and Jones, 1995
).
Sub-graph matching
To tackle the problem of relating distant structural similarities, more abstract methods were developed based on the network of interactions between secondary structures. Such networks were amenable to the computational methods of graph analysis, allowing a structural similarity to be defined as a sub-graph isomorphism between two structures (Mitchell et al., 1989
; Artymiuk et al., 1990
; Koch et al., 1992
; Harrison et al., 2002
; Taylor, 2002c
). One advantage of the approach is that it implicitly solves the problem of defining a domain since each domain is the connected components in the graph.
This approach is encountered again in the TOPS method that was derived from comparing topological diagrams of protein folds (Flores et al., 1994
; Gilbert et al., 1999
); however, this method is restricted to protein folds that incorporate a ß-sheet as it does not embody a detailed description of helical interactions.
In these approaches, the size of the common sub-graph found in two proteins can be taken as a measure of their similarity and used as a basis either for clustering or representing relationships as a network of connected cores.
Evolutionary approach
From the pioneering work of Ptitsyn and colleagues (Ptitsyn and Finkelstein, 1980
; Finkelstein and Ptitsyn, 1987
), it has been shown that the three-dimensional arrangement of helices and strands in larger proteins can be obtained from the stepwise addition of secondary structure elements (SSEs) to basic structural motifs (Efimov, 1993
, 1997
). Whether this accretion of SSEs reflects either a possible folding pathway for the protein or an evolutionary history is debatable but irrespective of any of these rationalizations, it provides a valid approach for the classification of protein folds.
Using this approach, the protein folds become organized into a phylogenetic tree (which may include missing-links). Unlike clustering by similarity, the tree can be arbitrarily deep, so relating the most dissimilar folds. A disadvantage, however, is that the construction of the trees is a manual operation that embodies an implicit set of assumptions and rules that are only to varying degrees stated explicitly.
Matching ideal forms
In a double attack on the problems of protein fold classification with domain definition, a series of ideal folds (called forms) were developed and matched at the level of secondary structures to known structures (Taylor, 2000
, 2002a
). Each successful match simultaneously identifies a fold and defines the domain. The set of ideal folds can be organized in a table which is not unlike a protein equivalent to the periodic table of elements (Taylor, 2002b
).This analogy is based on the correspondence between layers of secondary structure with electronic orbitals. Just as the orbitals become filled with electrons, so the layers become filled with secondary structures. In this arrangement, a step in any direction in the table represents the addition or deletion of an SSE in one of the layers.
The organization of known structure based on their ideal forms embodies many of the principles discussed above for the alternative approaches: for example, the identification of the largest common form shared between two proteins corresponds to their largest isomorphic sub-graph in the graph-based methods, while steps within the table are equivalent to the addition of secondary structures onto a core structure. The growth of a large structure from a simple core can then be viewed as a pathway through the table of forms (ToF) and a distance metric between structures can be established as the shortest path (or edit distance) between the structures.
Outline of the work
In this work, the connections between the evolutionary approach to protein fold analysis and pathways through the ToF are explored. The aim of this investigation is to develop both a metric of structure similarity based on an edit distance and also to establish in a rigorous way, the set of rules that can be used to constrain steps through the ToF.
The work concentrates on the three-layer ß
ß structures which provide a rich collection of structures that have been extensively analysed by Efimov into a large evolutionary tree.
| Materials and methods |
|---|
|
|
|---|
Topology strings
The encoding of protein architectures as layers of secondary structure allows the fold of the chain to be described as a series of moves between the layers. Concentrating on the three-layer ß
ß architecture, each layer can be designated by the letters A, B and C (respectively) specifying a position coordinate in one dimension. Location in the layer was encoded as a numeric displacement relative to the first SSE to be found in the layer (specifying a second coordinate dimension). The third dimension encodes just the orientation of each SSE relative to the first SSE. This string encoding is similar to that devised by Flower (1998
) but is directly rooted in a coordinate frame making it more computationally tractable. A chain path can then be encoded using three descriptors for each SSE, as shown in Figure 1.
|
This encoding scheme might appear to depend on the orientation of the molecule; however, both the orientation and positions of the SSEs are determined relative to the first strand and, if in addition, the labelling of the layers is not predefined, then the scheme can be made orientation independent. This latter property was achieved by assigning the label A to the first
layer to be occupied. Although the scheme is independent of orientation for the whole molecule, i.e. identical structures will have the same string, it remains sensitive to the starting point so two identical substructures within larger molecules need not have the same string. This difficulty will be considered further below. Largest common fold
Dynamic programming solution. Given two topology strings, the problem of finding the largest common fold can be solved using a variation of the dynamic-programming algorithm commonly used to align sequences. In this initial consideration of the problem, we will assume that the assignment of layers and the orientation of the first strand are the same in both strings.
Given the topology strings corresponding to the two proteins shown in Figure 1:
+B0.A0.+B2.C0.+B1.C1.+B1.C2.+B2
and
+B0.A0.+B1.C0.+B2.C1.+B3.A1.+B1
a matrix of scores is calculated by accumulating a positive score (+3) for a match in the SSE position and orientation relative to the last SSE in the same layer and a negative score (1) for a mismatch. Taking the two example topology strings (Figure 2), the first strands match and score 3, as do the following helices (giving 6), the next pair of strands have the same orientation but differ in their displacement from the first strand, so they mismatch, reducing the score to 5. Following this algorithm to completion identifies the highest scoring pathway through the matrix which corresponds to the elements that match in the common core. (See Figure 2 for further details.)
|
Substructure matching. The basic algorithm described above is limited by the requirement that SSE matches are dependent on their orientation and displacement relative to the first instance of an SSE in the same layer and so cannot be used to find substring matches (corresponding to local alignments in sequence matching). However, following the use of dynamic programming in three-dimensional structure matching (double dynamic programming) (Taylor and Orengo, 1989
C characters combined with the negation of all position values while a flip in up/down orientation requires the negation of the orientation sign combined with either of the preceding transformations. As the orientation is set by the forced match of the initial elements in the substrings, only one alternative string needs to be tested and the better match of the two retained as the solution. Additional constraints. To produce biologically meaningful results, the algorithm was modified so that solutions that contained a gap in a ß-sheet were discarded. This implies that the algorithm will not skip-over core ß-strands within a substructure but these are still able to be discarded from the edge of a sheet as mismatches in the normal way.
The solution to a single comparison is not necessarily unique and it can be seen from the example in Figure 2 that there are two equally high scoring paths (one substitutes a final helix in the match for a strand). In this situation, a decision was made to take the solution with most ß-strands.
An option was provided to report the largest common substructure (LCS) that does not contain any internal mismatches (corresponding to a strict sub-diagonal in the score matrix). This constrains solutions to be derived only by deletions from the termini of the chains. By analogy with sequence alignment, the two variants are referred to as global and local, with the latter being the more restrictive. The different behaviour of this option is shown in Figure 3. It was considered that the option that allows general deletion of structure was more biologically realistic and the results presented below were all generated with this option.
|
Hierarchical clustering
Proteins can now be clustered according to the common substructures that they share and two algorithms were investigated for this. The first clusters according to the size of the substructure: i.e. the largest substructure is chosen as the next node in the tree. The second creates clusters according to the distance between the structures: i.e. in each cycle the substructure which requires the least number of additions to form the two larger structures is chosen as the next node in the tree. These alternatives will be referred to as the basic and distance based algorithms, respectively.
In both algorithms, the new node is added to the pool of structures, and the two larger structures are removed. The common substructure of two proteins is then represented at their joining node as an ancestral structure, unless this also corresponds with one of the structures (i.e. the smaller structure is contained within the larger), in which case the smaller structure itself occupies the ancestral node.
Data
The method was tested on a subset of the three-layer ß
proteins analysed by Efimov (1997
). These were: (1) thiolase; (2) phosphoglucomutase (PGM); (3) adenylate kinase (ADK); (4) fructose-2,6-biphosphatase (FBiP); (5) phosphofructokinase (small domain) (PFK); (6) porphobilinogen deaminase (domain 2) (PBGD); (7) L-arabinose binding protein (Q domain) (ABP); (8) rhodanase (RHD); (9) dethiobiotin synthase (DBS); (10) subtilisin (SBT); (11) isocitrate dehydrogenase (ICD); (12) pyruvate decarboxylase (
,
-domains) (PDC); (13) hypoxanthine-guanine phosphoribosyltransferase (HGPRT); (14) pyruvate oxidase (POx); (15) mitochondrial F1-ATPase (ATPase); (16) rec A protein (recA). These proteins will be referred to in the text by name, or more commonly by their number (in parentheses) or abbreviation (in brackets). The topology strings for these structures are given in Table I.
|
For these structures, we have adopted the secondary structure definitions of Efimov. For the general application of the method, the problem of secondary structure definition will need to be addressed. This aspect will revisited in the Discussion.
The selected structures cover the major well populated part of the tree of structures drawn by Evimov (1997
, table 5) and ignores the part of his tree in which larger structures are connected by long sparsely populated branches back to the root.
| Results and discussion |
|---|
|
|
|---|
Each of the clustering methods described in Materials and methods (Hierarchical clustering) generated a reasonable tree structure of the similarities between the proteins. Allowing for some branch rearrangement in the representation of the trees (which were generated automatically), it can be seen that the two trees are almost isomorphic (Figures 4 and 5). The main difference is the insertion of some extra intermediate steps using the LCS-based variation of the clustering algorithm. Specifically, an extra node is inserted between the root of the tree and the two branches carrying thiolase (1) and adenylate kinase (3) and the intermediate node represented by the ancestral structure number 18 in Figure 5 is replaced by two ancestral structures in Figure 4.
|
|
The latter tree (Figure 4) was compared with a simplified representation of the tree constructed by Efimov (1997
|
The largest rearrangement, however, occurs between the two branches carrying subtilisin (10) and PFK (5). Efimov links subtilisin (10) by a long branch of seven ancestral structures back to a point where there is a common node with our automatically generated tree. Similarly, DBS (9) is linked back through five ancestral structures to PFK (5). In contrast, these two structures are linked by a common node (number 17) in our construction (Figure 4). This association involves the loss of two helices and an edge strand from DBS to recreate the unusual left-handed ß
ß unit found at the N-termini edge of the subtilisin sheet. Although this transition can be accomplished in three steps it is not an obvious route to take as it involves inserting/deleting a helix that lies between two existing helices. While the automatically generated trees are in broad agreement with the Efimov tree, there are details, involving the desirability of making particular insertions/deletions in core positions that need to be further examined. In the current work we have not attempted to make any further modifications to the basic algorithm but have, rather, assessed its behaviour on a well characterized group of protein structures. If it is assumed that Efimov provides a gold-standard for the relationships between structures, it would now be possible to modify the algorithm (adding further constraints or relaxing existing ones) to optimize the behaviour of the algorithm on its ability to regenerate the trees of Efimov. Rather than pursue this route, a more fundamental investigation would involve an attempt to derive the underlying constraints from the data. Using the automated approach described here, it will be possible to calculate trees of structures rapidly for each formulation of the rules, so allowing these to be varied until the most parsimonious tree is obtained.
The primary application of this approach is to impose a minimal hierarchical description on the relationship of protein folds in a way that clearly states the assumptions and rules that have been applied. While the resulting classification is in itself of value, it is also interesting to speculate whether the relationships might have resulted from a corresponding series of evolutionary events. If this were so then the ancestral nodes that have no equivalent known structure might correspond to relic structures that are yet to be discovered. Alternatively, it might be postulated that the resulting order reflects the constraints of similar folding pathwayswith the accretion of secondary structure imitating the assembly of the protein structure as it folds (as originally postulated by Ptitsyn and colleagues). These two options may even be simultaneously true: in the way in which ontogeny recapitulates phylogeny in embryonic development, so the protein folding pathway might equally recapitulate its evolutionary history.
For the further application of the method to a large set of protein structures, the problem of secondary structure definition will need to be addressed. For this, a method that is insensitive to the fine details of hydrogen-bonding is required and we are investigating the use of a line-segment based approach for this (Taylor, 2001
). More importantly, for the method to become fully automatic, it will also be necessary to have a robust method to define protein topology. As mentioned above, the matching of idealized forms can be used for this (Taylor, 2002b
) but this introduces the added complexity that there is not always a unique match of a form to a structure. To overcome this problem, the current method has the potential to be extended to consider alternative topology strings derived from multiple forms. Thus, rather than force the choice of a best match for each protein, the matches that give rise to the best path can be selected.
In this work, we have presented the first automatic method to cluster protein structures by their fold using a hierarchical tree organization with the representation of ancestral nodes. We have shown that, in outline, the resulting trees correspond well with those compiled by an expert and that some differences arise both through differing emphasis on the permitted rules of evolution and some from inconsistent manual application of these rules. The evolutionary significance of the resulting tree and its ancestral structures is difficult to evaluate but as the tree has been compiled without bias and without reference to function, the subsequent analysis of a larger number of structures with reference to their function may provide some insights.
| Acknowledgements |
|---|
We thank Beatrice Gorinsky for her support and encouragement during this project. L.O.J. was supported by BBSRC and the work formed part of the requirements for fulfilment of the MSc course in Crystallography at Birkbeck College, London.
| References |
|---|
|
|
|---|
Artymiuk,P.J., Rice,D.W., Mitchell,E.M. and Willett,P. (1990) Protein Eng., 4, 3943.
Efimov,A.V. (1993) Prog. Biophys. Mol. Biol., 60, 201239.[CrossRef][Web of Science][Medline]
Efimov,A.V. (1997) Proteins, 28, 241260.[CrossRef][Web of Science][Medline]
Finkelstein,A.V. and Ptitsyn,O.B. (1987) Prog. Biophys. Mol. Biol., 50, 171190.[CrossRef][Web of Science][Medline]
Flores,T.P., Moss,D.S. and Thornton,J.M. (1994) Protein Eng., 7, 3137.
Flower,D.R. (1998) Protein Eng., 11, 723727.
Gilbert,D., Westhead,D., Nagano,N. and Thornton,J.M. (1999) Bioinformatics, 15, 317326.
Hadley,C. and Jones,D.T. (1995) Structure, 7, 10991112.[CrossRef]
Harrison,A., Pearl,F., Mott,R., Thornton,J. and Orengo,C. (2002) J. Mol. Biol., 323, 909926.[CrossRef][Web of Science][Medline]
Holm,L. and Sander,C. (1997) Nucleic Acids Res., 25, 231234.
Koch,I., Kaden,F. and Selbig,J. (1992) Proteins, 12, 314323.[CrossRef][Web of Science][Medline]
Mitchell,E.M., Artymiuk,P.J., Rice,D.W. and Willett,P. (1989) J. Mol. Biol., 212, 151166.
Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) Protein Sci., 7, 24692471[Web of Science][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][Web of Science][Medline]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[Medline]
Ptitsyn,O.B. and Finkelstein,A.V. (1980) Q. Rev. Biophys., 13, 339386.[Web of Science][Medline]
Taylor,W.R. (2000) Biochem. Soc. Trans, 28, 264269.[Web of Science][Medline]
Taylor,W.R. (2001) J. Mol. Biol., 310, 11351150.[CrossRef][Web of Science][Medline]
Taylor,W.R. (2002a) In Mewes,B. and Weiss,H.S. (eds), Bioinformatics and Genome Analysis. Springer-Verlag, Berlin, Ernst Schering Research Foundation Workshop Vol. 38, pp. 133148.
Taylor,W.R. (2002b) Nature, 416, 657660.[CrossRef][Medline]
Taylor,W.R. (2002c) Mol. Cell. Proteomics, 1, 334339.
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[CrossRef][Web of Science][Medline]
Received June 19, 2003; revised September 13, 2003; accepted October 21, 2003
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






