Protein Engineering, Vol. 14, No. 8, 549-555,
August 2001
© 2001 Oxford University Press
Evaluation of a novel method for the identification of coevolving protein residues
1 Departments of Pure and Applied Chemistry 3 Statistics and Modelling Science, University of Strathclyde,295 Cathedral Street, Glasgow G1 1XL, UK 2 Present address: Cledwyn Building, Institute of Biological Sciences, University of Wales, Aberystwyth SY23 3DD, UK
| Abstract |
|---|
|
|
|---|
A novel method for the identification of correlated pairs in aligned homologous protein sequences is presented and evaluated against a model of simulated protein evolution incorporating covariation. Our method is shown to be capable of identifying all coevolutionary pairs of sites, with minimal interference by background correlations, in aligned sequence sets containing
60 sequences with a tree depth of at least 30 accepted point mutations. This result is expected even in the presence of a large degree of neutral and non-correlated evolution. It is postulated that, since naturally occurring protein families may be subject to stronger selection pressures and a lesser degree of neutral evolution, this method of covariation analysis may be generally more robust than the model would indicate.
Keywords: coevolution/covariation/proteins
Abbreviations: PAM, accepted point mutation BE, branching events PPB, accepted point mutations per branch nCk, the binomial coefficient, the number of ways of selecting k items from a set of n Poshnc, the probability of the observed split under the hypothesis of no correlation.
| Introduction |
|---|
|
|
|---|
Amino acid residue covariation, also called correlated change or coevolution, is a phenomenon observable in aligned homologous protein sequences. Pollock and Taylor (1997)
|
Despite these difficulties, covariation analysis has found use in the identification of intra- (Pollock et al., 1999
In this paper, we present a novel method for identifying covariation in aligned protein sequences and evaluate this method against a model of correlated evolution derived from that of Pollock and Taylor (Pollock and Taylor, 1997
). Our covariation analysis method is shown to be particularly successful for large sequence sets and evolutionary trees that have a high substitution rate, being capable of identifying all of the simulated correlated pairs, with a very low level of background noise.
| Materials and methods |
|---|
|
|
|---|
Covariation analysis
For each position in the set of aligned sequences, a one-dimensional array is constructed, f[1], f[2], ..., f[5] of frequencies of residue frequencies at each position. Residue frequency here means the number of times a particular amino acid is seen at the position concerned, in the set of aligned sequences. For example, if alanine, proline and glycine each occur in only two sequences at position X, then their residue frequencies at position X are 2. f[n] is the number of amino acids with residue frequency n at position X. If no other amino acids occur with the same frequency at position X, then for our example f[2] = 3 for position X.
On comparing pairs of sites in the table of aligned sequences (Figure 1
), if the side chain characters are divisible into exclusive blocks, then correlated change is suspected as having been an evolutionary force. Blocks are distinguished from each other by the observation that no site within such a block shares its side chain character with the same site in another block. For example, in Figure 1
, the correlated pair of positions 3 and 4 are divided into two blocks of two pairs each, one comprising the acidic side-chains aspartate and glutamate and the other of basic amino acids lysine and arginine. A graphical method for the identification of residue blocks is shown in Figure 2
.
|
Where the distribution of sequences within these blocks is such that the complete sequence set is split into two groups, one containing only one sequence and the other containing s 1 sequences (where s is the total number of sequences), then the result is not statistically significant. Where the split is into two blocks, one of size b and the other of size s b (i.e. b > 1), then the calculation proceeds as follows:
Case 1. Where b = 2, the value of
![]() |
Case 2. Where b = 3, the value of
![]() |
Case 3. Where b = 4, the value of
![]() |
Case 4. Where b = 5, the value of
![]() |
Where three or more blocks of sequences result from these divisions the calculations become more cumbersome, but the result is likely to be of interest. With three exceptions, the probability of the splitting pattern occurring by chance is assumed to be small. The exceptions are as follows for a total of s sequences:
Case 5. Where the split is {1, 1, s 2}, the value of f[1]C1 is calculated for each site and the product divided by sC2 gives Poshnc.
Case 6. Where the split is {1, 2, s 3}, the value of f[1]C1x(f[2]C2 + (f[1] 1)C2) is calculated for each site and Poshnc is given by the product of these divided by sC3.
Case 7. Where the splitting is of the pattern {1, 1, 1, s 3}, the value of f[1]C3 is calculated for each site and the product of these is divided by sC3 to give Poshnc.
A pair of positions is considered to be a correlated pair if it belongs to one of the seven cases described above and the calculated value of Poshnc is below a threshold value (in the present analysis this is set at 0.0005) or the splitting involves three blocks of a type other than those described above (cases 5, 6 and 7) or the splitting involves more than three blocks.
Simulation of protein evolution
A test set of protein sequences with known correlated sites was generated by simulated evolution using a model similar to that devised by Pollock and Taylor (1997)
. The use of artificially evolved sequences allowed us to control the extent of coevolution and the restrictions placed on sequence evolution avoid problems of determining covariation caused by the incorrect alignment of naturally occurring sequences.
Evolutionary selection was simulated to act at each position according to one of two modelseither a background model or a correlated evolution modeldepending on whether or not the position was chosen to evolve as part of a correlated pair. Initially an ancestral sequence of 50 residues was generated with an amino acid composition corresponding to the equilibrium residue frequencies in naturally occurring proteins (Jones et al., 1992
). Three different, mutually exclusive pairs of sites in this sequence were randomly selected to evolve in a correlated fashion throughout the simulation, while the remaining 44 positions were defined as background sites.
Evolution at the background sites was simulated by substituting amino acids at each site independently according to the Dayhoff et al. (1978)
PAM002 mutation data matrix (MDM).
Each correlated pair was evolved by designating one of the two sites as a driver site and substituting amino acids at this site independently according to a modified DAY002 MDM (Table I
). This modified matrix is biased to increase the probability of substitution by an amino acid of similar physicochemical character [according to the best Euclidean partition given by Stanfel (Stanfel, 1996
)] compared with the DAY002 MDM, but retains the same forbidden substitutions as the DAY002 matrix. This bias reflects the expected high level of conservation of amino acid and physical character at functionally selected sites relative to functionally neutral sites.
|
The other site in each correlated pair was designated as a dependent site. Allowed amino acid substitutions at this site are dependent on the amino acid side chain that is present at the associated driver position. The substitution matrix for dependent positions is also a modified DAY002 matrix (Table II
|
This model of correlated change is derived from the approach published by Pollock and Taylor (Pollock and Taylor, 1997
0.98 and 0.02. This frequency is only approximate, however, because false correlation between a designated correlated pair of sites may still be detected even if the dependent residue is not in its most favoured state with respect to the driver residue. Tree Structure
The sequence data set was created by duplicating each sequence after a simulated constant time period. Since branch length of an evolutionary tree cannot be distinguished from time period, this time is measured in PAMs and substitution rate is synonymous with branch length. Between duplications, each separate sequence was allowed to evolve randomly according to the background and correlated pair matrices for the appropriate number of PAMs.
Duplication of sequences was repeated a number of times, k. The final level of the tree structure thus contained 2k sequences. Sequences can be sampled in a number of ways from this final level to produce balanced trees and various degrees of unbalanced trees (Figure 3
). The branch length of the final level could also be altered, as in the Pollock and Taylor study (Pollock and Taylor, 1997
), to produce deep, shallow or even terminal splits. For even terminal splits, the branch length of the final level is the same as for all other levels, in shallow terminal splits it is shorter and in deep terminal splits it is longer. These simulate approximately a constant evolutionary rate, ancient divergence and recent divergence respectively. For this study, all trees were evenly branched and simulated a constant evolutionary rate, with the exceptions detailed below.
|
Application
Sequence generation was automated and applied using a Silicon Graphics Indy workstation and a PC compatible clone. The resulting program, RANDCORR, is available as executables and C source code, with instructions and manual, at http://users.aber.ac.uk/lep/programs.shtml
Correlation analysis
Correlation probabilities for pairs of sites in the artificially evolved sequences were determined using the program PRESTO (Peter Bladon, Interprobe Chemical Services), which employs the method described in this paper. Since this method only considers amino acid substitutions and ignores physicochemical vectors, it identifies only magnitude of, and disregards the direction of, change.
Covariation analysis as described in this paper was applied to 140 (five replicates of 28 combinations of branching events and substitution rate) evenly balanced trees of different depths and substitution rates containing eight, 16, 32 or 64 taxa at the final level. Two parameters were varied: the substitution rate (branch length) and number of branching events. These result in three potentially influential parameters: substitution rate, number of branching events and tree depth (substitution ratexnumber of branching events).
Owing to limitations of the program PRESTO, a maximum of 60 sequences could be simultaneously analysed by this method. In the case where there were 64 sequences in the final level of the tree (six branching events), four sequences were chosen randomly and deleted. In these cases, the evolutionary tree was not quite evenly branched, nor did it have even terminal splits and the overall evolutionary rate deviated slightly from constancy.
The performance of the analytical method was assessed in three ways. First, a measure of noise was calculated, as the number of pairs identified as correlated by the program which were not any of the originally designated correlated pairs. Second, the signal was measured, as the number of pairs identified as correlated that were originally designated as correlated pairs. Third, these two measures were combined in a signal/noise ratio, expressing the proportion of identified pairs that were truly correlated as a percentage.
| Results |
|---|
|
|
|---|
Background correlated change (noise): Figure 4a
At low substitution rates of around 2 PAM per branch, the number of background correlated pairs is fairly low. Up to five branching events, the general trend is that the number of background correlations increases with substitution rate. However, for six branching events, the number of background correlations is low and independent of substitution rate (Figure 4a
). A similar pattern is seen for constant tree depth, where the number of background correlations is dependent on tree depth, but drops significantly for the trees with the greatest number of branching events (Figure 4b
).
|
The number of falsely identified correlated pairs is significantly lower for the largest number of branching events at all substitution rates greater than 12 PAM per branch (Figure 4c
On the whole, sequence sets with lower substitution rates produce fewer false correlations, but the number of false correlations is independent of substitution rate at six branching events. Fewer false correlations are also seen for shallower trees, but again this effect is not seen with the largest sequence set. These results suggest that sample size is a very important factor in determining the absolute level of noise for balanced trees with a practically constant rate of substitution. This indicates that sets of sequences with only a few members are likely to be misleading for covariation analysis.
True correlations (signal): Figure 5a
c
Since only three correlated pairs were simulated in each set of sequences, the number of identified true correlations is bounded by the values 0 and 3. The rate of substitution appears to be important for the number of true pairs that are identified. At a rate of 2 PAM per branch, the maximum average number of pairs identified was 0.5. An increase of rate to 6 PAM or more per branch improved this such that, on average, more than two of three truly correlated pairs are identified after five branching events (Figure 5a
).
|
The number of branching events also influences the success of the covariation method, as sequence sets with only three branching events consistently identified less than half of the designated correlated pairs and were always poorer than sets with more branching events and the same substitution rate. By a substitution rate of 12 PAM per branch all but the lowest number of branching events identified, on average, more than two of the three correlated pairs (Figure 5b
Figure 5c
indicates that the efficacy of the covariation detection method is independent of overall tree depth, especially at five or more branching events, although the shortest trees at 30 PAM consistently underperformed the others. Again, the number of truly correlated pairs that are identified generally increases with number of branching events.
From these results it appears that in order to identify the majority of true correlations in a sequence set the rate of substitution must be greater than 2 PAM per branch and the data set must contain at least 16 sequences. Hence analysis of medium-sized to large data sets with the method described here would be likely to identify all truly correlated pairs of sites.
Proportion of identified correlations that are true (signal/noise): Figure 6a
c
The proportion of all correlations that are true is independent of overall tree depth and appears to rise exponentially with increasing number of branching events (Figure 6a
).
|
For all rates of substitution the proportion of true correlations is fairly similar, but increases dramatically for the greatest number of branching events. This is most pronounced for the fastest rates of substitution (Figure 6b
The relationship between number of branching events and the proportion of correlated pairs that are true is plain in Figure 6c
. From these results it is clear that the greater the number of branching events, the greater is the proportion of identified correlated pairs that are truly correlated. This again indicates that the best results for this method of covariation analysis will be obtained with large datasets.
| Discussion |
|---|
|
|
|---|
Identification of true correlation
In order to validate and assess the effectiveness of this method of covariation analysis, it was important to determine whether the method could identify true correlation in a system of controlled, simulated evolution. The results indicate that our method does successfully identify the majority of true correlations and minimize the extent of spuriously identified false correlations in the system described above, provided that certain criteria are met.
The most critical factor for the identification of truly correlated pairs appears to be the number of branching events, which corresponds approximately to the number of sequences in the final level of the tree constructed for the simulated sequences and is equivalent to the sample size of a natural data set. The minimum size of data set for which the method can identify more than 60% of true correlated pairs appears to be around 16 sequences (four branching events). The proportion of true pairs that are identified rises to nearly 100% for the 16-sequence set at high rates of substitution. For larger data sets of 32 and 60 sequences (five and six branching events), more than 60% of true correlated pairs are identified at relatively low rates of substitution (6 PAM per branch).
For any given rate of substitution, increasing the sample size improves the proportion of true correlated pairs that are identified (Figure 6c
). For four branching events (16 sequences) a substitution rate of 6 PAM per branch only identifies fewer than half of the true correlated pairs, whereas a substitution rate of 20 PAM per branch identifies nearly all the true correlated pairs. This suggests that this method, when applied to a sequence set whose ancestral sequence is fairly distant (more than
50 PAM) with a fairly large sample size (more than 16 sequences) is likely to identify the majority of true correlated pairs. The larger the sample size, the less distant the ancestral sequence needs to be to ensure that the majority of true correlations are identified.
One major assumption of both our analysis and our evolutionary model, which was also made in the Pollock and Taylor model (Pollock and Taylor, 1997
), is that pairs of sites which show correlated evolution do not change their relationship, in terms of either sequential location or extent of correlation over time. As was pointed out by Pollock et al. (Pollock et al., 1999
), this may be more likely to be true in trees that do not show extensive pairwise divergence and there may be problems with obtaining enough sequences for valid analysis in such a closely related tree. However, without an appropriately tested method of analysis for this type of covariation, the extent and effect of this limitation cannot be quantified. Moreover, the preliminary results from our method applied to aligned sequence sets of the size and substitution rates expected to be most amenable to analysis show no lack of potentially correlated pairs (L.Pritchard, M.J.Dufton and K.Reynolds, unpublished work).
Background correlation
One important factor that it was required to determine was the extent to which the detection of falsely correlated background pairs affects the identification of true correlated pairs. Any method that identifies true correlation will be severely limited if, in addition, it identifies a large number of randomly correlated pairs as there is currently no way to determine empirically whether correlation is random or causal in a natural data set (Pollock and Taylor, 1997
).
Our first measure of the extent of background correlation was the number of pairs that were identified by the analysis, but were not truly correlated. From Figure 6c
it is obvious that the most important factor in the number of detected background correlations is the number of branching events, i.e. the sample size. Where the number of sequences is large (60 sequences), the number of identified background pairs is low (around eight) and independent of the rate of substitution. For smaller sequence sets, the number of background correlations that are identified increases with increasing rate of substitution.
This effect can be rationalized in that, for small sets of sequences, the probability of random substitutions at different sites in different lineages is fairly high and these can look like correlations (Figure 7
). As the number of sequences increases through duplication, the probability of two substitutions at a single site, breaking the pattern of correlation at these sites, increases.
|
Low substitution rate
At the lowest substitution rate, the analysis was seen to identify consistently low numbers of both true and false correlated pairs in this analysis (Figures 4a and 5a![]()
). This suggests that the substitution rate of 2 PAM per branch is so low that covariation is either undetectable by this form of covariation analysis or simply indistinguishable from other patterns of substitution under these conditions.
Naturally occurring sequences: selection versus random drift
In naturally occurring sequences, unlike the simulation, the probability of any substitution being accepted is site dependent. Positions in a set of aligned, wild sequences will generally be either more or less conserved than any mutation data matrix would predict. It is not clear exactly what impact this will have when this form of covariation analysis is applied to natural sequences. It is possible that, in a case where all substitutions are controlled by selection processes, only those positions that are associated with modifications of the fitness of the protein will exhibit substitution. In effect, all variable positions would then be correlated through a combined contribution to functional expression and there would be no random substitution and so no background correlations. However, if a neutral evolutionary mechanism dominated, neutral substitutions would be fixed by random drift and there would be a significant number of random substitutions and so a large number of random correlations (Figure 7
).
In effect, the model described above simulates the case where the only selective pressure is on the correlated sites and the remaining sites evolve randomly, i.e. all background substitutions are neutral. Thus the model simulates a small amount of selection in a sea of random drift. In the real world, this is not necessarily the case and the model presented here is possibly over-cautious in relation to natural sequences. These are probably subject to a greater level of selection pressure and correspondingly less open to random drift, especially if the Hopfield network model of protein evolution described in an earlier paper (Pritchard and Dufton, 2000
) is accurate. If this is so, then the model of evolution used for this analysis predicts too high a level of background correlation and the analytical method ought to perform well under conditions (branching events, substitution rate, etc.) that are less stringent than those indicated above.
Conclusion
The method for detecting correlated evolution described in this paper is likely to be capable of identifying all correlated pairs of sites, with minimal interference by background correlations, in aligned sequence sets containing around 60 sequences with a tree depth of at least 30 PAM. This result is expected even in the presence of a large degree of neutral and non-correlated evolution. It is postulated that, since naturally occurring protein families may be subject to stronger selection pressures and a lesser degree of neutral evolution, covariation analysis may be generally more robust than this model indicates.
| Notes |
|---|
4 To whom correspondence should be addressed. E-mail: mark.dufton{at}strath.ac.uk
| References |
|---|
|
|
|---|
Chelvanayagam,G., Eggenschwiler,A., Knecht,L., Gonnet,G.H. and Benner, S.A. (1997) Protein Eng., 10, 307316.
Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345352.
Gobel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins: Struct,. Funct. Genet., 18, 309317.[Web of Science][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Comput. Appl. Biol. Sci., 8, 275282.
Neher,E. (1994) Proc. Natl Acad. Sci. USA, 91, 98102.
Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 271, 511523.[Web of Science][Medline]
Pollock,D.D. and Taylor,W.R. (1997) Protein Eng., 10, 647657.
Pollock,D.D., Taylor,W.R. and Goldman,N. (1999) J. Mol. Biol., 287, 187198.[Web of Science][Medline]
Pritchard,L. and Dufton,M.J. (2000) J. Theor. Biol., 202, 7786.[Web of Science][Medline]
Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349358.
Stanfel,L.E. (1996) J. Theor. Biol., 183, 195205.[Web of Science][Medline]
Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341348.
Received June 30, 2000; revised May 21, 2001; accepted May 31, 2001.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. G. Williams and S. C. Lovell The Effect of Sequence Evolution on Protein Structural Divergence Mol. Biol. Evol., May 1, 2009; 26(5): 1055 - 1065. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. A. Travers, D. C. Tully, G. P. McCormack, and M. A. Fares A Study of the Coevolutionary Patterns Operating within the env Gene of the HIV-1 Group M Subtypes Mol. Biol. Evol., December 1, 2007; 24(12): 2787 - 2801. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Fares and S. A. A. Travers A Novel Method for Detecting Intramolecular Coevolution: Adding a Further Dimension to Selective Constraints Analyses Genetics, May 1, 2006; 173(1): 9 - 23. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Buck and W. R. Atchley Networks of Coevolving Sites in Structural and Functional Domains of Serpin Proteins Mol. Biol. Evol., July 1, 2005; 22(7): 1627 - 1634. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















