Protein Engineering, Vol. 15, No. 3, 193-203,
March 2002
© 2002 Oxford University Press
Protein sequence comparison based on the wavelet transform approach
BioElectronics Group, Department of Electrical and Computer Systems Engineering, PO Box 35, Monash University, VIC 3800, Australia
| Abstract |
|---|
|
|
|---|
A protein's chemical properties, the chain conformation, the function of the protein and its species specificity are determined by the information contained in the amino acid sequence. Proteins of similar functions have at some level sequential identical amino acid sequences. The closer the phylogenetic relationship, the more similar are the sequences. To find the similarities between two or more protein sequences is of great importance for protein sequence analysis. The differences in the amino acid sequences permit the construction of a family tree of evolution. In this work, a comparison method was devised that is capable of analysing a protein sequence `hierarchically', i.e. it can examine a protein sequence at different spatial resolutions. Based on a wavelet decomposition of protein sequences and a cross-correlation study, a sequencescale similarity concept is proposed for generating a similarity vector, which renders the comparison of two sequences feasible at different spatial resolutions (scales). This new similarity concept is an expansion of the conventional sequence similarity, which only takes into account the local pairwise amino acid match and ignores the information contained in coarser spatial resolutions.
Keywords: heme proteins/resonance recognition model/wavelet transform
| Introduction |
|---|
|
|
|---|
Protein comparison and alignment still represent one of the most important and widely used methods of protein sequence analysis (Bishop and Rawlings, 1996
Often within a protein class, only a few amino acid residues could be designated as `invariable'. A substitution of such amino acids would destroy both biological activity and function. A consequence of the different amino acid sequences of functionally similar proteins is their immunological diversity, their species specificity. The similarities can help to identify individual amino acids crucial for the biological function, proteintarget interaction and structure maintenance. The structurally essential similarities can be most effectively deduced from amino acid exchange frequencies in proteins of different species. It is not only the local similarity but also the global similarity between sequences needs to be found (Bishop and Rawlings, 1996
). The similarities can be expressed as a template or a motif, the determinant of a specific structure and function.
Derivation of the three-dimensional structure from the amino acid sequence would be worthwhile, since it cannot be expected that all proteins would produce suitable crystals for X-ray analysis (Goffin et al., 1996
). Thus, if a protein sequence of unknown function and unknown structure is compared with other known sequences, its functional and structural information may be revealed by their similarity pattern.
Previous approaches such as FASTA, BLAST and PROSRCH (Pearson and Lipman, 1988
; Bishop and Rawlings, 1996
) are mainly based on sequence comparison and alignment. The concept of the similarity (a sequence similarity) for those approaches only means how many identical pairs of amino acids exist for the query sequence and the subject sequence.
However, two protein sequences with low sequential identity may show similarities in their physicochemical properties, tertiary structure, resonance recognition model (RRM) spectra and biological functions (Lesk, 1988
; Cosic, 1994
, 1997
). This similarity concept can be enriched by incorporating the notion of similarity in other contexts.
The RRM multiple-cross spectral function can be regarded as a measurement of the similarity among different protein sequences in the frequency domain when each protein sequence is treated as a numerical series (Cosic, 1994
, 1997
). The most prominent peak frequencies show the spectral similarity of the protein sequences. Furthermore, the similarity can be either a local similarity or long-range similarity, the overall sequence similarity. For those traditional sequence comparison approaches, to find the local similarity is relatively easy but to find the global similarity is a difficult task (Bishop and Rawlings, 1987
). The significance of the similarity is also hard to assess by those approaches. The spectrum similarity determined by the RRM is a global similarity because the spectrum is a contribution of all individual amino acids in the sequence.
Another analytical approach is the wavelet transform (WT) representation. It is a signal processing method efficient for multi-resolution analysis and local feature extraction (Daubechies, 1988
, 1992
). If the WT is introduced to a protein sequence, the similarity can be measured at different resolution scales based on a space-scale analysis. This sequencescale similarity may reveal more information than other conventional methods.
| Materials and methods |
|---|
|
|
|---|
The sequencescale similarity measurement introduced here is based on the discrete wavelet transform (DWT) (Daubechies, 1988
The RRM model
The RRM (Cosic and Nesic, 1988
; Cosic et al., 1989
, 1991
; Cosic, 1990
, 1994
, 1995
, 1996
, 1997
; Cosic and Hearn, 1991
) is a physical and mathematical model which interprets protein sequence linear information using signal analysis methods. It comprises two stages. The first involves the transformation of the amino acid sequence into a numerical sequence. Each amino acid is represented by the value of the electronion interaction potential (EIIP) (Veljkovic and Slavic, 1972
; Pirogova and Cosic, 1999
), which describes the average energy states of all valence electrons in particular amino acids (Table I
). The EIIP values for each amino acid were calculated using the following general model pseudopotential (Veljkovic and Slavic, 1972
; Pirogova and Cosic, 1999
):
![]() | (1) |
![]() | (2) |
|
In order to extract common spectral characteristics of sequences having the same or similar biological function, the following cross-spectral function was used:
![]() | (3) |
numerical series
amplitude spectra
cross spectra, is represented in Figure 1
- and ß-hemoglobins.
|
To determine the common frequency components for a group of protein sequences, we calculated the absolute values of multiple cross-spectral function coefficients M, which are defined as follows:
![]() | (4) |
Peak frequencies in such a multiple cross-spectral function denote common frequency components for all sequences analysed. The signal-to-noise ratio (S/N) for each peak is defined as a measure of similarity between sequences analysed. S/N is calculated as the ratio between signal intensity at the particular peak frequency and the mean value over the whole spectrum. The extensive experience gained from previous research (Cosic, 1994
, 1995
, 1996
, 1997
) suggests that an S/N of at least 20 can be considered significant. The multiple cross-spectral function for a large group of sequences with the same biological function has been named `consensus spectrum'. The presence of a peak frequency with significant S/N in a consensus spectrum implies that all of the analysed sequences within the group have one frequency component in common. This frequency is related to the biological function provided that the following criteria are met:
- one peak only exists for a group of protein sequences sharing the same biological function;
- no significant peak exists for biologically unrelated protein sequences;
- peak frequencies are different for different biological functions.
In our previous studies (Table II
), the above criteria were tested with over 1000 proteins from 28 functional groups (Cosic, 1994
, 1995
, 1996
, 1997
; Trad et al., 2000
). Multiple cross-spectral functions of four different functional groups of proteins are represented in Figure 2
. The following fundamental conclusion was drawn from our studies: each specific biological function of protein or regulatory DNA sequence(s) is characterized by a single frequency. Once the RRM characteristic frequency for a particular biological function has been determined, it is possible to identify the individual amino acid so-called `hot spots' [using Fourier transformation (FT)] or domains [using the continuous wavelet transform (CWT) (Fang and Cosic, 1998
, 1999
; Trad et al., 2000
, 2001)] that contribute mostly to the characteristic frequency and thus also to the protein's biological function.
|
|
The physical meaning of the characteristic frequency
The correlation between the amplitude spectrum of numerical representation of genetic sequences and the corresponding biological function presented previously can lead to a completely new approach to protein dynamics. Each frequency in the RRM characterizes one biological function (Figure 2
). Each biological process involves a number of interactions between proteins and their targets (other protein, DNA regulatory segment or small molecule). Each of these processes involves energy transfer between interacting molecules. These interactions are highly selective and this selectivity is defined within the protein structure. The selectivity of these interactions is proposed to be the resonant energy transfer between interacting molecules (Cosic, 1994
). Consequently, the characteristic resonant frequencies for a number of different interactions, i.e. biological functions, were theoretically calculated (Table II
). These calculations were based on the following key finding: proteins with the same biological functions have common periodicities in the distribution of energies of delocalized electrons along the protein. With this in mind and taking into account the conductive properties of the protein backbone, the theoretical model of biologically relevant protein resonances was established (Cosic, 1990
, 1994
, 1997
).
The discrete wavelet transform
The wavelet transform (WT) is a relatively new signal processing tool efficient for multi-resolution analysis and local feature extraction of non-stationary signals (Daubechies, 1988
, 1992
). The wavelet transform can be viewed as an inner product operation that measures the similarity or cross-correlation between the signal and the wavelets.
The sequencescale similarity measurement introduced here is based on the discrete wavelet transform (DWT) and a cross-correlation analysis. The comparing sequences are initially `converted' into numerical series using the RRM (Cosic et al., 1989
; Cosic, 1997
). These numerical series are normalized to zero mean and unit standard deviation and zero-padded to have an identical sequence length. Then they are decomposed to M levels with details from level 1 to level M and an approximation at level M by the DWT. Because a correlation function quantifies the degree of interdependence of one process upon another or establishes the similarity between one set of data and another (Oppenheim and Schafer, 1997
), the cross-correlation coefficients are calculated at each level to establish and quantify the similarity between the two compared protein sequences. There are a total of M + 1 correlation coefficients. The value of a correlation coefficient lies between 1 and +1; +1 means 100% correlation in the same sense and 1 means 100% correlation in the opposing sense (Oppenheim and Schafer, 1997
). The cross-correlation coefficient is defined as
|
| (5) |
![]() | (6) |
The maximum absolute value of the correlation coefficient at each decomposition level is regarded as the similarity score for these two proteins at that level. Therefore, a total of M + 1 maximum values are taken out to form a sequencescale similarity vector. The sequencescale similarity vector depicts the similarity of two protein sequences at different scales or different frequency bands. More specifically, this vector describes the correlation with a multiresolution point of view.
The underlying property of wavelets is that they are localized in both time and frequency (Strang and Nguyen, 1996
). The product of the uncertainties of both time and frequency is bound by the Heisenberg's uncertainty principle; no filter can have a width product smaller than 1/
. The Gaussian filters attain this theoretical limit.
In this work we used the Bior3.3 biorthogonal wavelets (Cohen et al., 1992
) for the protein signal decomposition for all cases. Biorthogonal discrete wavelet transform uses two wavelets, one for decomposition and the other for reconstruction. Hence the analysis and synthesis tasks can be separated (Cohen et al., 1992
). Biorthogonal wavelets are symmetrical wavelets and have linear phase.
| Results |
|---|
|
|
|---|
Figure 3
- and two ß-peptide chains. Each of the four subunits of the hemoglobin molecule take up one oxygen atom. The relative positions of the subunits alter according to their state of oxidation. The subunits are capable of cooperation and the uptake and evolution of oxygen causes an allosteric conformational change in each of the subunits. Although the sequences of hemoglobin ß-chain and
-chain are not completely identical, they have exactly the same biological function (Lehninger et al., 1993
|
The discrete wavelet transform up to level 4 of a protein signal, hemoglobin human
-chain, an oxygen-carrying heme protein, is shown in Figure 4
|
Figure 5
-chains (hahu and haho). For biomedical signals, it is deemed strongly correlated if the correlation coefficient exceeds ±0.7 and weakly correlated if the correlation coefficient is between ±0.7 and ±0.5 (Oyster et al., 1987
|
Figure 6
- and ß-hemoglobins. The similarity vector is (0.97 0.62 0.44 0.39 0.23), revealing one strongly correlated frequency band A4 (= 0.97) and one weakly correlated frequency band D4 (= 0.62). Because these two polypeptides share the oxygen-carrying function, it is reasonable to consider that these two frequency bands are essential to this biological function. This result is also consistent with that from RRM: according to the RRM, a resonant frequency at 0.0234 characterizes the common biological function of hemoglobins (Table II
|
Similarity measurement of functional related sequences
The similarity of closely related sequences is obvious. However, it is difficult to find the sequence similarity for proteins which are distantly related but have similar biological function or tertiary structure. For example, sperm whale myoglobin and lupine leghemoglobin have only 15% identical residues, which is far below the twilight zone of sequence identity, although they both contain a heme group, have similar secondary and tertiary structures and bind oxygen (Doolittle, 1981
). The cross-correlation analysis of lupine leghemoglobin and sperm whale myoglobin revealed the sequence-similarity vector (0.40, 0.53, 0.44, 0.36, 0.25), showing a weak correlation in D4 (= 0.53). It is reasonable to deduce that this correlation is related to their sharing biological function, the oxygen binding capability.
Another example is chymotrypsin and subtilisin. These two proteins have a very low sequence identity, only 12% even using an optimal alignment method. However, they share a common proteolytic function and a common catalytic mechanism as an example of convergent evolution (Lesk, 1988
). Because of the low sequence identity of these two pairs of proteins, it is unlikely that they can be linked together using the sequence alignment methods. However, using the sequencescale similarity as defined above, we still can probe their distant connections. The sequencescale similarity analysis of chymotrypsin and subtilisin revealed the sequence-similarity vector (0.35, 0.60, 0.42, 0.25, 0.18). At D4 (= 0.60), there is also a weak correlation for these two distantly related proteins.
Myoglobin is an oxygen-carrying globular heme protein like hemoglobin involved in oxygen storage and transport in vertebrate muscle. The myoglobin molecule is built up of eight helices, which compose a box-like structure with a hydrophobic pocket. The heme group responsible for oxygen binding (Fe2+-porphyrin) is fixed in this pocket only by weak bonding. Myoglobin and hemoglobin are composed of an association of smaller subunits (
- and ß-chains) and are thought to be evolutionarily related (Lehninger et al., 1993
). The sequence similarity of myoglobin and hemoglobin is very poor. However, Figure 7
indicates that hemoglobin and myoglobin are not dissimilar in the sense of the sequencescale similarity. There are two weakly correlated frequency bands A4 and D4 which have correlation coefficients 0.63 and 0.60, respectively. Moreover, hemoglobin
-chain and ß-chain are also correlated in these two frequency bands (see Figure 6
). These two proteins have a strong correlation (correlation coefficient 0.97) and a weak correlation (correlation coefficient 0.62). This gives more evidence that frequency bands A4 and/or D4 contain the information related to the oxygen-carrying function of those proteins (hemoglobin, sigmoid oxygen saturation curve; myoglobin, hyperbolic saturation curve).
|
Cytochrome c is another heme-containing protein. Cytochrome c transfers electrons from the QH2cytochrome c reductase complex to the cytochrome c oxidase complex. Figure 8
-chain and pig cytochrome c. There is no strong cross-correlation but a weak correlation at D3. Although cytochromes and hemoglobins have very low sequence similarity, they do both possess a heme prosthetic group. Whether or not the weak correlation in D3 is due to the common heme group still needs further exploration.
|
Similarity measurement of non-functional related sequences
The sequencescale similarity vector shows a strong cross-correlation between two closely related proteins and a certain correlation for two functionally related proteins. One requirement for choosing an appropriate analysis tool for protein sequence is to have a direct relationship with the underlying processing. This requires that a self-contained similarity measurement scheme shall give no-correlation results for functionally and/or structurally unrelated proteins.
Lysozyme is a widespread enzyme found especially in animal secretions, in egg white and in some microorganisms. It splits the glycosidic bond between certain residues in mucopolysaccharides and mucopeptides of bacterial cell walls. Lysozyme and hemoglobin do not share any biological function. This is also shown (Figure 9
) by the cross-correlation study of their DWTs. In Figure 9
, there is no peak that exceeds the weak correlation boundary 0.5.
|
A further study was carried out to calculate the sequencescale similarity vectors among eight arbitrarily chosen different protein sequences. The comparison result is given in Table III
|
| Discussion and conclusion |
|---|
|
|
|---|
Sequencescale similarity studies of several pairs of protein examples (Figures 59
This finding indicates that the functional or structural similarity of two protein sequences could be revealed by the sequencescale study. One important judgement to compare different computational approaches is how well they perform in finding low degrees of similarity (Bishop and Rawlings, 1996
). Hence the sequencescale similarity can be a very promising tool for sequence comparison with the important advantage of not requiring indels.
These comparative studies have provided new insights into the structurefunction relationships of certain groups of proteins. The results in Table III
generally match the biological relationships of each protein pair. Using BLAST for the protein pair hemoglobin
-chain (hahu: 142 amino acids) and itself revealed the following results: score = 286 bits; identities = 142/142 (100%); positives = 142/142 (100%). The sequencescale similarity vector shows complete correlation in all five scales (5S). Using BLAST for the protein pair hemoglobin
-chain (hahu: 142 amino acids) and sperm whale myoglobin (mwhp: 153 amino acids) revealed the following results: score = 46.2 bits; identities = 37/147 (25%); positives = 59/147 (39%); Gaps = 6/147 (4%). The sequencescale similarity vector (0.63 0.60 0.48 0.31 0.30) shows weak correlations at A4 and D4 expressed as 2W3N. It is reasonable to deduce that this correlation is related to their sharing biological function, the oxygen binding capability. Only the fgfbh (basic human growth factor) and legh (lupine leghemoglobin) have a clear correlationalthough no reported common biological properties of them have been found. The reason that causes this exception is still not clear.
Thus a fundamental and empirical conclusion for sequencescale similarity measurement is reached:
- For closely related proteins, e.g. homologous proteins, there is a strong sequencescale cross-correlation.
- For proteins that are distantly related but with similar biological functions, there is a clear sequencescale correlation. The correlation need not necessarily appear in each scale. The correlated scales are deemed to contain the information crucial to the common biological functions.
- For proteins that are distantly related and have no common biological functions, there is generally no sequencescale correlation.
There are two additional advantages of the sequencescale similarity measurement. First, the significance of the similarity is given directly by the correlation value rather than an alignment score as shown in the discussion above. The results derived from a sequence comparison scheme measure the quality of the alignment. Thus with the sequencescale similarity vectors, the similarity significance can be compared, assessed and interpreted easily. For the conventional comparison methods, the comparison score needs to be processed using various empirical and statistical methods before it can be evaluated (Bishop and Rawlings, 1987
; Lesk, 1988
). Second, with the introduction of a cross-correlation function, the deletion and insertion which are often used in other conventional sequence comparison and alignment schemes are no longer needed. All the drawbacks derived from the gap insertion and deletion are not inherent to this method at all. Therefore, proteins with different sequence lengths can be compared easily.
Having in mind that the majority of theoretically predicted biological properties of proteins in this paper are functionally important, we can conclude that this study confirms our earlier hypothesis that the WT method could be established as a novel approach to examine protein sequences at different spatial resolutions.
| Notes |
|---|
1 To whom correspondence should be addressed. Present address: School of Electrical and Computer Engineering, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. E-mail: irena.cosic{at}rmit.edu.au
| References |
|---|
|
|
|---|
Bishop,M. and Rawlings,C. (1987) Nucleic Acid and Protein Sequence Analysis A Practical Approach. IRL Press, Oxford.
Bishop,M. and Rawlings,C. (1996) DNA and Protein Sequence Analysis A Practical Approach. IRL Press, Oxford.
Cohen,A., Daubechies,I. and Feauveau,J.C. (1992) Commun. Pure Appl. Math., 45, 485560.
Cosic,I. (1990) In Wise,D. (ed.), Bioinstrumentation and Biosensors. Marcel Dekker, New York, pp. 475510.
Cosic,I. (1994) IEEE Trans. Biomed. Eng., 41, 11011114.[CrossRef][Web of Science][Medline]
Cosic,I. (1995) Bio/Technology, 13, 236238.[CrossRef][Medline]
Cosic,I. (1996) Med. Biol. Eng. Comput., 34, 139140.
Cosic,I. (1997) The Resonant Recognition Model of Macromolecular Activity. Birkhauser, Basel.
Cosic,I. and Hearn,M.T.W. (1991) J. Mol. Recognit., 4, 5762.[CrossRef][Medline]
Cosic,I. and Nesic,D. (1988) Eur. J. Biochem., 170, 247252.[Web of Science][Medline]
Cosic,I., Pavlovic V. and Vojisavljevic,V. (1989) Biochimie, 71, 333342.[Medline]
Cosic,I., Hodder,A., Aguilar,M. and Hearn,M.T.W. (1991) Eur. J. Biochem., 198, 113119.[Web of Science][Medline]
Daubechies,I. (1988) Commun. Pure Appl. Math., 41, 909996.
Daubechies,I. (1992) Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia.
Doolittle,R.F. (1981) Science, 214, 149159.
Fang,Q. and Cosic,I. (1998) Aus. Phy. Eng. Sci. Med., 21, 179185.
Fang,Q. and Cosic,I. (1999) In Proceedings of the Inaugural Conference of the Victorian Chapter of the IEEE EMBS. pp. 211214.
Goffin,V., Martial,J.A. and Summers,N.L. (1996) Protein Eng., 8, 12151231.
Lehninger,A.L., Nelson,D.L. and Cox,M.M. (1993) In Principles of Biochemistry. Worth, New York.
Lesk,A.M. (1988) In Computational Molecular Biology. Oxford University Press, Oxford.
Oppenheim,A.V. and Schafer,R.W. (1997) In Discrete-time Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.
Oyster,C.K., Hanten,W.O. and Liorence,L.A. (1987) In Introduction to Research: a Guide for the Health Science Professional. Lippincott, Oxford.
Pearson,W.R. and Lipman,D. J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444.
Pirogova,E. and Cosic,I. (1999) In Proceedings of the Inaugural Conference of the Victorian Chapter of the IEEE EMBS. pp. 203206.
Strang,G. and Nguyen,T. (1996) In Wavelets and Filter Banks. Wellesley-Cambridge Press, Wellesley.
Trad,C.H., Fang,Q. and Cosic,I. (2000) Biophys. Chem., 84, 149157.[CrossRef][Web of Science][Medline]
Trad,C.H., Fang,Q. and Cosic, I (2001) In Proceedings of the 2nd Conference of the Victorian Chapter of the IEEE EMBS. pp. 115119.
Veljkovic,V. and Slavic,I. (1972) Phys. Rev. Lett., 29, 105108.[CrossRef]
Received October 24, 2001; revised December 18, 2001; accepted January 4, 2002.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













