Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pritchard, L.
Right arrow Articles by Dufton, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pritchard, L.
Right arrow Articles by Dufton, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Protein Engineering, Vol. 14, No. 8, 549-555, August 2001
© 2001 Oxford University Press

Evaluation of a novel method for the identification of coevolving protein residues

Leighton Pritchard1,2, Peter Bladon1, Jane M. O. Mitchell3 and Mark J. Dufton1,4

1 Departments of Pure and Applied Chemistry 3 Statistics and Modelling Science, University of Strathclyde,295 Cathedral Street, Glasgow G1 1XL, UK 2 Present address: Cledwyn Building, Institute of Biological Sciences, University of Wales, Aberystwyth SY23 3DD, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
A novel method for the identification of correlated pairs in aligned homologous protein sequences is presented and evaluated against a model of simulated protein evolution incorporating covariation. Our method is shown to be capable of identifying all coevolutionary pairs of sites, with minimal interference by background correlations, in aligned sequence sets containing ~60 sequences with a tree depth of at least 30 accepted point mutations. This result is expected even in the presence of a large degree of neutral and non-correlated evolution. It is postulated that, since naturally occurring protein families may be subject to stronger selection pressures and a lesser degree of neutral evolution, this method of covariation analysis may be generally more robust than the model would indicate.

Keywords: coevolution/covariation/proteins

Abbreviations: PAM, accepted point mutation • BE, branching events • PPB, accepted point mutations per branch • nCk, the binomial coefficient, the number of ways of selecting k items from a set of n • Poshnc, the probability of the observed split under the hypothesis of no correlation.


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Amino acid residue covariation, also called ‘correlated change’ or ‘coevolution’, is a phenomenon observable in aligned homologous protein sequences. Pollock and Taylor (1997)Go defined covariation thus: ‘When the probabilities of [accepted] substitution at [one] site change depending on the residue at [the other] site, the two sites are correlated’. Covariation is usually manifest in the observation that amino acid ‘X’ is found at one site in the primary sequence when the residue at some other site is ‘Y’ and that the residue at the first site is ‘A’ when the residue at the second site is ‘B’ and so on (Figure 1Go). The detection of statistically significant correlation between residue pairs has been the subject of several analyses and various methods for identifying covariant pairs have been developed (e.g. Gobel et al., 1994Go; Neher, 1994Go; Shindyalov et al., 1994Go; Taylor and Hatrick, 1994Go; Chelvanayagamet al., 1997Go; Pollock and Taylor, 1997Go; Pollock et al. 1999Go). The results of these analyses were inconclusive and led to conflicting opinions about the nature and frequency of intramolecular covariation in naturally occurring protein sequences (Pazos et al., 1997Go). It is still not clear what drives covariation or what the phenomenon may reflect about the structure–activity relationships present in a protein family.



View larger version (67K):
[in this window]
[in a new window]
 
Fig. 1. . Two positions in a set of aligned, homologous protein sequences are covariant if the allowed choice of side chain at one position appears linked with the choice at another position (see text). This can be identified by eye for a pair of sites in a set of aligned sequences when, for example, one site has state A while the second site has state B, but has state X when the second site has state Y. The letters in this figure, having their conventional one-letter amino acid code meanings, represent aligned pentapeptide fragments of four protein sequences. There is an invariant residue at position 2, but positions 1, 3, 4 and 5 vary across the sequences. Positions 3 and 4 are most obviously correlated, since the residue at 3 is D when 4 is E and K when 4 is R. Positions 1 and 5, however, show no clear correlation with any other position.

 
Despite these difficulties, covariation analysis has found use in the identification of intra- (Pollock et al., 1999Go) and intermolecular (Pazos et al., 1997Go) interactions amongst proteins. The presence of coevolution as a phenomenon and the ability to detect it has also been used to derive a model of protein evolution which is analogous to the process by which a trained Hopfield neural network recalls a stored ‘memory’ (Pritchard and Dufton, 2000Go).

In this paper, we present a novel method for identifying covariation in aligned protein sequences and evaluate this method against a model of correlated evolution derived from that of Pollock and Taylor (Pollock and Taylor, 1997Go). Our covariation analysis method is shown to be particularly successful for large sequence sets and evolutionary trees that have a high substitution rate, being capable of identifying all of the simulated correlated pairs, with a very low level of background noise.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Covariation analysis

For each position in the set of aligned sequences, a one-dimensional array is constructed, f[1], f[2], ..., f[5] of frequencies of residue frequencies at each position. Residue frequency here means the number of times a particular amino acid is seen at the position concerned, in the set of aligned sequences. For example, if alanine, proline and glycine each occur in only two sequences at position X, then their residue frequencies at position X are 2. f[n] is the number of amino acids with residue frequency n at position X. If no other amino acids occur with the same frequency at position X, then for our example f[2] = 3 for position X.

On comparing pairs of sites in the table of aligned sequences (Figure 1Go), if the side chain characters are divisible into exclusive blocks, then correlated change is suspected as having been an evolutionary force. Blocks are distinguished from each other by the observation that no site within such a block shares its side chain character with the same site in another block. For example, in Figure 1Go, the correlated pair of positions 3 and 4 are divided into two blocks of two pairs each, one comprising the acidic side-chains aspartate and glutamate and the other of basic amino acids lysine and arginine. A graphical method for the identification of residue ‘blocks’ is shown in Figure 2Go.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2. . Two matrices are shown, relating to the sequences in Figure 1Go. These matrices provide a graphical method of identifying ‘blocks’ of co-occurring residues (see text) The co-occurrence of residues at positions 3 and 4 (matrix a) and at positions 3 and 5 (matrix b) are indicated by crosses. Matrix a shows that, since the two crosses cannot be linked by ‘rook's moves’ (as in the game of chess), the residue pairs DE and KR constitute two ‘blocks’. In matrix b, all the crosses may be linked by ‘rook's moves’, indicated by lines in the figure and so there is only one residue ‘block’.

 
Where the distribution of sequences within these blocks is such that the complete sequence set is split into two groups, one containing only one sequence and the other containing s 1 sequences (where s is the total number of sequences), then the result is not statistically significant. Where the split is into two blocks, one of size b and the other of size s b (i.e. b > 1), then the calculation proceeds as follows:

Case 1. Where b = 2, the value of

is calculated for both sites. The probability that there is no correlation, Poshnc, is then calculated as the product of these values divided by sC2.

Case 2. Where b = 3, the value of

is calculated for both sites. Poshnc is then given by the product of these values divided by sC3.

Case 3. Where b = 4, the value of

is calculated for both sites. Poshnc is then given by the product of these values divided by sC4.

Case 4. Where b = 5, the value of

is calculated for both sites. Poshnc is then given by the product of these values divided by sC5.

Where three or more blocks of sequences result from these divisions the calculations become more cumbersome, but the result is likely to be of interest. With three exceptions, the probability of the splitting pattern occurring by chance is assumed to be small. The exceptions are as follows for a total of s sequences:

Case 5. Where the split is {1, 1, s – 2}, the value of f[1]C1 is calculated for each site and the product divided by sC2 gives Poshnc.

Case 6. Where the split is {1, 2, s – 3}, the value of f[1]C1x(f[2]C2 + (f[1] – 1)C2) is calculated for each site and Poshnc is given by the product of these divided by sC3.

Case 7. Where the splitting is of the pattern {1, 1, 1, s 3}, the value of f[1]C3 is calculated for each site and the product of these is divided by sC3 to give Poshnc.

A pair of positions is considered to be a correlated pair if it belongs to one of the seven cases described above and the calculated value of Poshnc is below a threshold value (in the present analysis this is set at 0.0005) or the splitting involves three blocks of a type other than those described above (cases 5, 6 and 7) or the splitting involves more than three blocks.

Simulation of protein evolution

A test set of protein sequences with known correlated sites was generated by simulated evolution using a model similar to that devised by Pollock and Taylor (1997)Go. The use of artificially evolved sequences allowed us to control the extent of coevolution and the restrictions placed on sequence evolution avoid problems of determining covariation caused by the incorrect alignment of naturally occurring sequences.

Evolutionary selection was simulated to act at each position according to one of two models—either a background model or a correlated evolution model—depending on whether or not the position was chosen to evolve as part of a correlated pair. Initially an ‘ancestral’ sequence of 50 residues was generated with an amino acid composition corresponding to the equilibrium residue frequencies in naturally occurring proteins (Jones et al., 1992Go). Three different, mutually exclusive pairs of sites in this sequence were randomly selected to evolve in a correlated fashion throughout the simulation, while the remaining 44 positions were defined as background sites.

Evolution at the background sites was simulated by substituting amino acids at each site independently according to the Dayhoff et al. (1978)Go PAM002 mutation data matrix (MDM).

Each correlated pair was evolved by designating one of the two sites as a ‘driver’ site and substituting amino acids at this site independently according to a modified DAY002 MDM (Table IGo). This modified matrix is biased to increase the probability of substitution by an amino acid of similar physicochemical character [according to the best Euclidean partition given by Stanfel (Stanfel, 1996Go)] compared with the DAY002 MDM, but retains the same forbidden substitutions as the DAY002 matrix. This bias reflects the expected high level of conservation of amino acid and physical character at functionally selected sites relative to functionally ‘neutral’ sites.


View this table:
[in this window]
[in a new window]
 
Table I. . The modified DAY002 MDM cumulative probability matrix for driver sites

 
The other site in each correlated pair was designated as a ‘dependent’ site. Allowed amino acid substitutions at this site are dependent on the amino acid side chain that is present at the associated driver position. The substitution matrix for dependent positions is also a modified DAY002 matrix (Table IIGo) in which each amino acid at the driver site favours a single amino acid at the dependent site (complementary pairs were chosen randomly and are not based on any concept of physicochemical compatibility), but also allows substitutions that do not conserve the correlated pair. As with the substitution matrix for the driver position, substitutions by amino acids of similar chemical character to that at the dependent site are favoured and the matrix retains the same forbidden substitutions as the DAY002 matrix. Both the driver and dependent site MDMs are set at a rate of substitution of approximately 2PAM, the same as the background matrix. Hence background sites and correlated sites each attempt substitutions at the same rate.


View this table:
[in this window]
[in a new window]
 
Table II. . The modified DAY002 MDM cumulative probability matrix for dependent sites

 
This model of correlated change is derived from the approach published by Pollock and Taylor (Pollock and Taylor, 1997Go). The most significant difference between the two models is that the Pollock and Taylor model involves a simple two-state model for producing correlated states, whereas this model is 20-state at each site. In the Pollock and Taylor model, the rate of exchange between states at the driver site is nominally variable, as is the rate of exchange between dependent states, although the dependent site equilibrium frequencies were in fact set to 1 and 0 to create completely correlated sites. The 20-state method used in this paper can be generalized to a similar two-state model, but with corresponding equilibrium frequencies of ~0.98 and 0.02. This frequency is only approximate, however, because false correlation between a designated ‘correlated pair’ of sites may still be detected even if the dependent residue is not in its ‘most favoured’ state with respect to the driver residue.

Tree Structure

The sequence data set was created by duplicating each sequence after a simulated constant time period. Since branch length of an evolutionary tree cannot be distinguished from time period, this time is measured in PAMs and substitution rate is synonymous with branch length. Between duplications, each separate sequence was allowed to ‘evolve’ randomly according to the background and correlated pair matrices for the appropriate number of PAMs.

Duplication of sequences was repeated a number of times, k. The final level of the tree structure thus contained 2k sequences. Sequences can be sampled in a number of ways from this final level to produce balanced trees and various degrees of unbalanced trees (Figure 3Go). The branch length of the final level could also be altered, as in the Pollock and Taylor study (Pollock and Taylor, 1997Go), to produce deep, shallow or even terminal splits. For even terminal splits, the branch length of the final level is the same as for all other levels, in shallow terminal splits it is shorter and in deep terminal splits it is longer. These simulate approximately a constant evolutionary rate, ancient divergence and recent divergence respectively. For this study, all trees were evenly branched and simulated a constant evolutionary rate, with the exceptions detailed below.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3. . Schematic of the tree produced by the sequence generation methods described by Pollock and Taylor (Pollock and Taylor, 1997Go) and in this paper, with four duplication events. For an evenly branched tree, all 16 sequences from the final level are considered. In the maximally imbalanced tree, only five sequences (e.g. those numbered in the diagram) are included. The branches in the lowest level are longer than the rest for deep terminal splits, shorter for shallow terminal splits and the same length for even terminal splits. Note that the tree indicated here has a constant rate of substitution for each branch.

 
Application

Sequence generation was automated and applied using a Silicon Graphics Indy workstation and a PC compatible clone. The resulting program, RANDCORR, is available as executables and C source code, with instructions and manual, at http://users.aber.ac.uk/lep/programs.shtml

Correlation analysis

Correlation probabilities for pairs of sites in the artificially evolved sequences were determined using the program PRESTO (Peter Bladon, Interprobe Chemical Services), which employs the method described in this paper. Since this method only considers amino acid substitutions and ignores physicochemical vectors, it identifies only magnitude of, and disregards the direction of, change.

Covariation analysis as described in this paper was applied to 140 (five replicates of 28 combinations of branching events and substitution rate) evenly balanced trees of different depths and substitution rates containing eight, 16, 32 or 64 taxa at the final level. Two parameters were varied: the substitution rate (branch length) and number of branching events. These result in three potentially influential parameters: substitution rate, number of branching events and tree depth (substitution ratexnumber of branching events).

Owing to limitations of the program PRESTO, a maximum of 60 sequences could be simultaneously analysed by this method. In the case where there were 64 sequences in the final level of the tree (six branching events), four sequences were chosen randomly and deleted. In these cases, the evolutionary tree was not quite evenly branched, nor did it have even terminal splits and the overall evolutionary rate deviated slightly from constancy.

The performance of the analytical method was assessed in three ways. First, a measure of ‘noise’ was calculated, as the number of pairs identified as ‘correlated’ by the program which were not any of the originally designated ‘correlated pairs’. Second, the ‘signal’ was measured, as the number of pairs identified as correlated that were originally designated as ‘correlated pairs’. Third, these two measures were combined in a ‘signal/noise’ ratio, expressing the proportion of identified pairs that were truly correlated as a percentage.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Background correlated change (noise): Figure 4aGo–c

At low substitution rates of around 2 PAM per branch, the number of background correlated pairs is fairly low. Up to five branching events, the general trend is that the number of background correlations increases with substitution rate. However, for six branching events, the number of background correlations is low and independent of substitution rate (Figure 4aGo). A similar pattern is seen for constant tree depth, where the number of background correlations is dependent on tree depth, but drops significantly for the trees with the greatest number of branching events (Figure 4bGo).





View larger version (46K):
[in this window]
[in a new window]
 
Fig. 4. . The number of observed background correlations, i.e. those pairs of sites that were identified as correlated which were not originally designated as such, plotted against (a) substitution rate in PAMs per branch, for a constant number of branching events (BE), (b) the number of branching events for a constant tree depth (PAM) and (c) the number of branching events for a constant rate of substitution (PAMs per branch, PPB). In all cases, indicated errors are ±1 standard deviation.

 
The number of falsely identified correlated pairs is significantly lower for the largest number of branching events at all substitution rates greater than 12 PAM per branch (Figure 4cGo). Generally, the number of background pairs decreases with increasing number of branching events. There is an exceptional case for three branching events, where it is likely that the relatively small sample size prevents the generation of a very large number of background correlations.

On the whole, sequence sets with lower substitution rates produce fewer false correlations, but the number of false correlations is independent of substitution rate at six branching events. Fewer false correlations are also seen for shallower trees, but again this effect is not seen with the largest sequence set. These results suggest that sample size is a very important factor in determining the absolute level of noise for balanced trees with a practically constant rate of substitution. This indicates that sets of sequences with only a few members are likely to be misleading for covariation analysis.

True correlations (signal): Figure 5aGo–c

Since only three correlated pairs were simulated in each set of sequences, the number of identified true correlations is bounded by the values 0 and 3. The rate of substitution appears to be important for the number of true pairs that are identified. At a rate of 2 PAM per branch, the maximum average number of pairs identified was 0.5. An increase of rate to 6 PAM or more per branch improved this such that, on average, more than two of three truly correlated pairs are identified after five branching events (Figure 5aGo).





View larger version (50K):
[in this window]
[in a new window]
 
Fig. 5. . Plots of the number of true correlations, i.e. pairs identified as correlated which were originally designated correlated pairs, against (a) number of branching events for a constant substitution rate (PPB: PAMs per branch), (b) substitution rate (PAMs per branch) for constant number of branching events (BE) and (c) number of branching events for a constant tree depth (PAM).

 
The number of branching events also influences the success of the covariation method, as sequence sets with only three branching events consistently identified less than half of the designated correlated pairs and were always poorer than sets with more branching events and the same substitution rate. By a substitution rate of 12 PAM per branch all but the lowest number of branching events identified, on average, more than two of the three correlated pairs (Figure 5bGo).

Figure 5cGo indicates that the efficacy of the covariation detection method is independent of overall tree depth, especially at five or more branching events, although the shortest trees at 30 PAM consistently underperformed the others. Again, the number of truly correlated pairs that are identified generally increases with number of branching events.

From these results it appears that in order to identify the majority of true correlations in a sequence set the rate of substitution must be greater than 2 PAM per branch and the data set must contain at least 16 sequences. Hence analysis of medium-sized to large data sets with the method described here would be likely to identify all truly correlated pairs of sites.

Proportion of identified correlations that are true (signal/noise): Figure 6aGo–c

The proportion of all correlations that are true is independent of overall tree depth and appears to rise exponentially with increasing number of branching events (Figure 6aGo).





View larger version (45K):
[in this window]
[in a new window]
 
Fig. 6. . Plots of the proportion of true correlated pairs against (a) number of branching events for constant tree depth (PAM), (b) number of branching events for constant rate of substitution (PPB, PAMs per branch) and (c) rate of substitution (PAMs per branch) for constant number of branching events (BE).

 
For all rates of substitution the proportion of true correlations is fairly similar, but increases dramatically for the greatest number of branching events. This is most pronounced for the fastest rates of substitution (Figure 6bGo). This indicates that the most reliable results are likely to be found in large data sets with a fairly high degree of sequence divergence and high rate of substitution.

The relationship between number of branching events and the proportion of correlated pairs that are true is plain in Figure 6cGo. From these results it is clear that the greater the number of branching events, the greater is the proportion of identified correlated pairs that are truly correlated. This again indicates that the best results for this method of covariation analysis will be obtained with large datasets.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Identification of true correlation

In order to validate and assess the effectiveness of this method of covariation analysis, it was important to determine whether the method could identify true correlation in a system of controlled, simulated evolution. The results indicate that our method does successfully identify the majority of true correlations and minimize the extent of spuriously identified false correlations in the system described above, provided that certain criteria are met.

The most critical factor for the identification of truly correlated pairs appears to be the number of branching events, which corresponds approximately to the number of sequences in the final level of the tree constructed for the simulated sequences and is equivalent to the sample size of a ‘natural’ data set. The minimum size of data set for which the method can identify more than 60% of true correlated pairs appears to be around 16 sequences (four branching events). The proportion of true pairs that are identified rises to nearly 100% for the 16-sequence set at high rates of substitution. For larger data sets of 32 and 60 sequences (five and six branching events), more than 60% of true correlated pairs are identified at relatively low rates of substitution (6 PAM per branch).

For any given rate of substitution, increasing the sample size improves the proportion of true correlated pairs that are identified (Figure 6cGo). For four branching events (16 sequences) a substitution rate of 6 PAM per branch only identifies fewer than half of the true correlated pairs, whereas a substitution rate of 20 PAM per branch identifies nearly all the true correlated pairs. This suggests that this method, when applied to a sequence set whose ancestral sequence is fairly distant (more than ~50 PAM) with a fairly large sample size (more than 16 sequences) is likely to identify the majority of true correlated pairs. The larger the sample size, the less distant the ancestral sequence needs to be to ensure that the majority of true correlations are identified.

One major assumption of both our analysis and our evolutionary model, which was also made in the Pollock and Taylor model (Pollock and Taylor, 1997Go), is that pairs of sites which show correlated evolution do not change their relationship, in terms of either sequential location or extent of correlation over time. As was pointed out by Pollock et al. (Pollock et al., 1999Go), this may be more likely to be true in trees that do not show extensive pairwise divergence and there may be problems with obtaining enough sequences for valid analysis in such a closely related tree. However, without an appropriately tested method of analysis for this type of covariation, the extent and effect of this limitation cannot be quantified. Moreover, the preliminary results from our method applied to aligned sequence sets of the size and substitution rates expected to be most amenable to analysis show no lack of potentially correlated pairs (L.Pritchard, M.J.Dufton and K.Reynolds, unpublished work).

Background correlation

One important factor that it was required to determine was the extent to which the detection of falsely correlated background pairs affects the identification of true correlated pairs. Any method that identifies true correlation will be severely limited if, in addition, it identifies a large number of randomly correlated pairs as there is currently no way to determine empirically whether correlation is random or causal in a natural data set (Pollock and Taylor, 1997Go).

Our first measure of the extent of background correlation was the number of pairs that were identified by the analysis, but were not truly correlated. From Figure 6cGo it is obvious that the most important factor in the number of detected background correlations is the number of branching events, i.e. the sample size. Where the number of sequences is large (60 sequences), the number of identified background pairs is low (around eight) and independent of the rate of substitution. For smaller sequence sets, the number of background correlations that are identified increases with increasing rate of substitution.

This effect can be rationalized in that, for small sets of sequences, the probability of random substitutions at different sites in different lineages is fairly high and these can look like correlations (Figure 7Go). As the number of sequences increases through duplication, the probability of two substitutions at a single site, breaking the pattern of ‘correlation’ at these sites, increases.



View larger version (6K):
[in this window]
[in a new window]
 
Fig. 7. . The ancestral sequence of 50 residues has a fragment ABCDE. Of these sites, only position 3, residue C, is involved in one of the three truly correlated pairs. Here a false correlated pair occurs when, in one lineage, the non-correlated position 2 (residue B) is substituted by X and in the other lineage non-correlated position 4 (residue D) is substituted by Y. The result is that, at the non-correlated positions 2 and 4, one lineage has residues XD and the other has BY, which appear to be correlated in the analysis results.

 
Low substitution rate

At the lowest substitution rate, the analysis was seen to identify consistently low numbers of both true and false correlated pairs in this analysis (Figures 4a and 5aGoGo). This suggests that the substitution rate of 2 PAM per branch is so low that covariation is either undetectable by this form of covariation analysis or simply indistinguishable from other patterns of substitution under these conditions.

Naturally occurring sequences: selection versus random drift

In naturally occurring sequences, unlike the simulation, the probability of any substitution being accepted is site dependent. Positions in a set of aligned, wild sequences will generally be either more or less conserved than any mutation data matrix would predict. It is not clear exactly what impact this will have when this form of covariation analysis is applied to natural sequences. It is possible that, in a case where all substitutions are controlled by selection processes, only those positions that are associated with modifications of the fitness of the protein will exhibit substitution. In effect, all variable positions would then be correlated through a combined contribution to functional expression and there would be no ‘random’ substitution and so no background correlations. However, if a neutral evolutionary mechanism dominated, ‘neutral’ substitutions would be fixed by random drift and there would be a significant number of random substitutions and so a large number of random correlations (Figure 7Go).

In effect, the model described above simulates the case where the only selective pressure is on the correlated sites and the remaining sites evolve randomly, i.e. all background substitutions are neutral. Thus the model simulates a small amount of selection in a ‘sea’ of random drift. In the real world, this is not necessarily the case and the model presented here is possibly over-cautious in relation to natural sequences. These are probably subject to a greater level of selection pressure and correspondingly less open to random drift, especially if the Hopfield network model of protein evolution described in an earlier paper (Pritchard and Dufton, 2000Go) is accurate. If this is so, then the model of evolution used for this analysis predicts too high a level of background correlation and the analytical method ought to perform well under conditions (branching events, substitution rate, etc.) that are less stringent than those indicated above.

Conclusion

The method for detecting correlated evolution described in this paper is likely to be capable of identifying all correlated pairs of sites, with minimal interference by background correlations, in aligned sequence sets containing around 60 sequences with a tree depth of at least 30 PAM. This result is expected even in the presence of a large degree of neutral and non-correlated evolution. It is postulated that, since naturally occurring protein families may be subject to stronger selection pressures and a lesser degree of neutral evolution, covariation analysis may be generally more robust than this model indicates.


    Notes
 
4 To whom correspondence should be addressed. E-mail: mark.dufton{at}strath.ac.uk Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Chelvanayagam,G., Eggenschwiler,A., Knecht,L., Gonnet,G.H. and Benner, S.A. (1997) Protein Eng., 10, 307–316.[Abstract/Free Full Text]

Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345–352.

Gobel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins: Struct,. Funct. Genet., 18, 309–317.[Web of Science][Medline]

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Comput. Appl. Biol. Sci., 8, 275–282.[Abstract/Free Full Text]

Neher,E. (1994) Proc. Natl Acad. Sci. USA, 91, 98–102.[Abstract/Free Full Text]

Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 271, 511–523.[Web of Science][Medline]

Pollock,D.D. and Taylor,W.R. (1997) Protein Eng., 10, 647–657.[Abstract/Free Full Text]

Pollock,D.D., Taylor,W.R. and Goldman,N. (1999) J. Mol. Biol., 287, 187–198.[Web of Science][Medline]

Pritchard,L. and Dufton,M.J. (2000) J. Theor. Biol., 202, 77–86.[Web of Science][Medline]

Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349–358.[Abstract/Free Full Text]

Stanfel,L.E. (1996) J. Theor. Biol., 183, 195–205.[Web of Science][Medline]

Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341–348.[Abstract/Free Full Text]

Received June 30, 2000; revised May 21, 2001; accepted May 31, 2001.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
S. G. Williams and S. C. Lovell
The Effect of Sequence Evolution on Protein Structural Divergence
Mol. Biol. Evol., May 1, 2009; 26(5): 1055 - 1065.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
S. A. A. Travers, D. C. Tully, G. P. McCormack, and M. A. Fares
A Study of the Coevolutionary Patterns Operating within the env Gene of the HIV-1 Group M Subtypes
Mol. Biol. Evol., December 1, 2007; 24(12): 2787 - 2801.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. A. Fares and S. A. A. Travers
A Novel Method for Detecting Intramolecular Coevolution: Adding a Further Dimension to Selective Constraints Analyses
Genetics, May 1, 2006; 173(1): 9 - 23.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. J. Buck and W. R. Atchley
Networks of Coevolving Sites in Structural and Functional Domains of Serpin Proteins
Mol. Biol. Evol., July 1, 2005; 22(7): 1627 - 1634.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pritchard, L.
Right arrow Articles by Dufton, M. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pritchard, L.
Right arrow Articles by Dufton, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?