Protein Engineering, Vol. 15, No. 12, 955-957,
December 2002
© 2002 Oxford University Press
Closed loops: persistence of the protein chain returns
1 Department of Structural Biology, The Weizmann Institute of Science, P.O.B. 26, Rehovot 76100 and 3 Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa 31905, Israel
| Abstract |
|---|
|
|
|---|
It has recently been discovered that globular proteins are universally built from standard loop-n-lock units of about 30 amino acid residues. The hypothesis has been put forward on the loop stage in the protein evolution when the units were autonomous. Later they joined together making longer chains. One would expect that the early individual loop-n-lock elements might still be detected in modern protein sequences as remnants of the hypothetical 30-residue sequence prototypes. Among several strong sequence motifs, extracted from protein sequences of 23 complete bacterial proteomes, one 32-residue prototype was studied here in detail. Numerous sequence segments related to the prototype are identified in the crystal structures of proteins of a PDB_SELECT database. Analysis of the respective chain trajectories for the cases with different degrees of sequence conservation confirms that the majority of the segments correspond to the closed loops. In the evolutionary diversification of the prototypes the secondary structure yields first, while the sequence is still moderately conserved. The last feature to go is the chain return property. Apparently, the opening of the loops would severely destabilize the protein fold, which explains their conservation.
Keywords: closed loops/protein chain return persistence/protein evolution
| Introduction |
|---|
|
|
|---|
A protein chain trajectory makes many returns to itself, thus forming closed loops (Berezovsky et al., 2000
| Materials and methods |
|---|
|
|
|---|
The protein sequences of the following complete prokaryotic genomes were used for the calculations: Archaea, A.pernix, A.fulgidus, M.thermoautotrophicum and P.abyssi; and Eubacteria, A.aeolicus, B.burgdorferii, C.jejuni, C.pneumoniae, C.trachomatis, D.radiodurans, E.coli, H.influenzae, H.pylori, M.tuberculosis, M.pneumoniae, N.meningitidis, R.prowazekii, Synechocystis, T.maritima, T.pallidum, U.urealyticum,V. cholerae and X. fastidiosa. The sequences were provided by the National Center for Biotechnology Information, via Entrez Browser. The sequences were used without any filtering.
The search for the most frequent motifs of the size 30 residues consisted of the following steps. (i) Every 30 amino acid long segment from an exhaustive collection of about one million different segments taken from the proteome of E.coli is matched to all protein sequences of 23 complete bacterial proteomes. The number of matching fragments is counted. In this first step the threshold is taken as equal 11 matching residues, to ensure a sufficiently large number of well matching (>37% match) fragments. (ii) The collected matching fragments (of the order of several hundred for a successful sequence motif) are used for derivation of an initial matrix of distribution of 20 amino acid residues in 30 positions. (iii) The initial matrix is used for the next round of collecting the matching fragments. In this case the comparison is made between the matrix and the tested sequence, rather than sequence-to-sequence as in the initial stage. The similarity of each 30-residue long tested sequence from the bacterial proteomes is calculated as an average of matching normalized matrix elements. This average should exceed a certain minimal value: the threshold of similarity in which case the sequence fragment is included in the family for calculation of the next round matrix. Inclusion of sequence fragments of lower similarity would cause instability of the matrix, that is, its divergence in the iteration process. The range of the threshold values corresponding to stable solutions may vary (between 0.4 and 0.8) depending on the amino acid compositions of the initial 30-residue sequence and of the whole sequence ensemble. It is noteworthy that the convergence of the iterated matrices to a stable solution is the sole criterion to consider the respective pattern as being present in the sequences, irrespective of the numbers of the matching segments in natural and random (shuffled) sequences. This also means that the choice of the initial match and the subsequent matrix thresholds is empirical. (iv) From a total of about one million different tested segments are chosen those which show the highest final scores after several rounds (typically, 510 rounds) of convergent iteration of the matrices. (v) The sequence dimension of the resulting matrix is adjusted by inspecting frequencies of the amino acid residues below and beyond the initial 30-residue range. For example, the matrix for the sequence prototype analyzed in this work has the sequence dimension of 32 residues. In the histogram shown in Figure 1
it corresponds to the dark gray area with high frequency values at the edges. If the dimension is taken beyond the 32 residues, the respective frequency values will drop to the background level. On the other hand, the choice of a shorter range will result in high values beyond it, indicating that the choice is not optimal. The size of the family of the descendants of this prototype is 978 fragments. The similarity threshold is taken as equal to 0.4.
|
| Results and discussion |
|---|
|
|
|---|
Table I
|
Figure 1
Figure 2
displays the structures from the PDB SELECT database of crystallized proteins (Hobohm and Sander, 1994
), corresponding to the sequences matching the prototype. A total of 32 such sequence segments are located in the database, with a match of 926 residues. For respective random sequences the typical match is 25 residues. (If the sequences of 30 residues of uniform amino acid composition are compared, the expected match is 30/20 = 1.5 residues; the higher match of 25 residues is due to non-uniformity of the composition.) Figure 2
includes four cases with the highest matches observed [from 11 to 26 residues, respectively; Figure 2(A)(D)
], and also as a representative set of lower sequence match structures. Among the segments with a match of nine residues [Figure 2(H)(L)
] and data not shown), 12 display a closed loop structure and 10 segments have a non-loop appearance [as in Figure 2(K) and (L)
]. The four higher sequence match structures (1126 residues) are all of the type
turnß. With the match 910 residues (28 cases in total), this structural motif still dominates, appearing nine times. In other cases these are either loops of different structures [e.g. Figure 2(F), (H) and (I)
] or non-loop sections (seven and 12 times, respectively). Thus, of 32 structures, the highest sequence match cases (four) correspond to standard
turnß elements; nine more cases have lower sequence match, but still retain the
turnß structure; seven structures of lower sequence match have lost the
turnß motif, while retaining their loop property; finally, 12 elements of lower sequence match have lost both the typical structural motif and the chain return property.
|
As the data above demonstrate, the prototype
turnß structure survives even after 72% of the presumed original prototype sequence has been lost. However, the chain return property is still conserved in the low-match structures, where the prototype secondary structure is lost. Apparently, the return property is still maintained owing to either marginal sequence conservation or supporting influence of the remaining parts of the fold or both. Unfolding of the closed loop would cause severe changes in the path of the protein chain, while sequence variations and changes in the details of secondary structure would only be of local influence and, in general, of less importance for the overall protein fold.
This work introduces an important dimension in protein evolutionary studies. Highly diverged protein segments with barely recognizable sequence similarity and with no structural resemblance may still be related as soon as segments appear as closed loops in the folds. For possible generalization of the above observations, other sequence prototypes have to be analyzed in a similar way (work in progress). Many of the closed loops observed in the crystallized proteins resemble the structural types described in the early study by Levitt and Chothia (Levitt and Chothia, 1976
). It remains to be seen what would be a complete spectrum of the closed loop structures.
| Notes |
|---|
2 To whom correspondence should be addressed. E-mail: igor.berezovsky{at}weizmann.ac.il
| Acknowledgments |
|---|
We are grateful to Mrs A.Weinberg for editing of the text. I.N.B. is a Post-Doctoral Fellow of the Feinberg Graduate School, Weizmann Institute of Science. V.M.K. is supported by the Ministry of Absorption.
| References |
|---|
|
|
|---|
Berezovsky,I.N. and Trifonov,E.N. (2001a) J. Mol. Biol., 307, 14191426.[CrossRef][Web of Science][Medline]
Berezovsky,I.N. and Trifonov,E.N. (2001b) Protein Eng., 14, 403407.
Berezovsky,I.N., Grosberg,A.Y. and Trifonov,E.N. (2000) FEBS Lett., 466, 283286.[CrossRef][Web of Science][Medline]
Berezovsky,I.N., Kirzhner,A., Kirzhner,V.M. and Trifonov,E.N. (2001) Proteins, 45, 346350.[CrossRef][Medline]
Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522524.[Web of Science][Medline]
Lamarine,M., Mornon,J.-P., Berezovsky,I.N. and Chomilier,J. (2001) Cell. Mol. Life Sci., 58, 492498.[CrossRef][Web of Science][Medline]
Levitt,M. and Chothia,C. (1976) Nature, 261, 552558.[CrossRef][Medline]
Trifonov,E.N. and Berezovsky,I.N. (2002) Mol. Biol., 36, 239243.[CrossRef]
Trifonov,E.N., Kirzhner,A., Kirzhner,V.M. and Berezovsky,I.N. (2001) J. Mol. Evol., 53, 394401.[CrossRef][Medline]
Received May 28, 2002; revised September 6, 2002; accepted October 1, 2002.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

