Skip Navigation


PEDS Advance Access first published online on January 11, 2007
This version published online on January 12, 2007

Protein Engineering Design and Selection, doi:10.1093/protein/gzl051
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
20/1/33    most recent
gzl051v2
gzl051v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Carugo, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Carugo, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oxfordjournals.org

Statistical validation of the root-mean-square-distance, a measure of protein structural proximity

Oliviero Carugo1,2,3

1 Department of General Chemistry, University of Pavia, Pavia, Italy 2 Department of Biomolecular Structural Chemistry, Max F. Perutz Laboratories, Vienna University, Campus Vienna Biocenter 5, A-1030 Vienna, Austria

3 To whom correspondence should be addressed. Email: oliviero.carugo{at}univie.ac.at


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
Despite its well-documented limitations, the root-mean-square-distance (rmsd) between pairs of equivalent atoms is routinely used to monitor the degree of similarity between two optimally superposed protein three-dimensional structures. A robust method for assessing the statistical significance of the difference between two rmsd values is presented here. It is based on the comparison of two protein structures through the correlation coefficient between equivalent inter-atomic distances and the subsequent application of the Fisher transformation that allows one to estimate the probability of identity between two correlation coefficient values. The relationship between the rmsd and Fisher correlation coefficient allows then to estimate the statistical significance of the difference between two rmsd values. Such a procedure is exemplified with the analysis of the possible classifications of the immunoglobulin-like domains of filamin and is compared to related estimations of structural similarity. The possibility to estimate the probability of the difference between two rmsd values can be used to optimize the protein structural classifications and comparisons, independent of the procedure used to derive the rmsds.

Keywords: filamin rod domain/protein structural similarity/protein structure classification/root-mean-square-distance


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
The similarity between two protein three-dimensional structures is usually measured through the root-mean-square-distance (rmsd) between pairs of equivalent C{alpha} atoms, computed after optimal superposition of the two structures. This is done either when two conformations of the same protein (bound/unbound, monomeric/oligomeric, etc.) are compared or when the comparison involves two different proteins that have different amino acidic sequences, though the equivalencies between pairs of C{alpha} atoms may be defined or discovered differently, depending on the degree of similarity between the sequences of the two proteins that are compared. The use of the rmsd might seem quite peculiar, since many other similarity measures were proposed and used in various fields of structural and molecular biology (Carugo and Eisenhaber, 1997Go; Carugo and Pongor, 2002aGo, 2002bGo; Carugo, 2006Go). Such an abundance of alternatives to rmsd is largely due to the fact that the rmsd values are known to have several major drawbacks. One is obvious. Superpositions and the resulting rmsd values may not be the best way to compare two protein structures, especially if a small perturbation in just one part of a protein (e.g. a hinge between two domains) can create large rmsd values, suggesting that the two structures are very different, despite the fact that they are not. In other words, the local structural similarity may be much greater than the overall similarity. Such a problem may be solved, at least partially, by subdividing the structures into single domains.

Moreover, the rmsd does not behave as a metrics, in the mathematical sense, unless the structures that are compared are very similar to each other (Maiorov and Crippen, 1995Go; Betancourt and Skolnick, 2001Go). For this reason, the rmsd_100 was proposed (Carugo and Pongor, 2001Go), which is the rmsd value that would be measured if the structures that are compared contained 100 residues. It was also observed that the rmsd values depend on the accuracy of the experimentally determined protein structures (Carugo, 2003Go). On average, smaller rmsd values are observed for protein structure pairs at better resolution and the rmsd values tend to increase if the two proteins that are compared were refined at different resolutions.

These drawbacks make it difficult to use rmsd values when several protein three-dimensional structures are compared in order to classify them through multivariate statistical techniques of machine learning algorithms. Nevertheless, it is still very common for structural biologists to use and publish rmsd values.

An open question about rmsd is the statistical significance of rmsd differences. In other words, if the two protein structures A and B (independent of their sequence similarity) are associated with an rmsd value equal to rAB and the two protein structures C and D (independent of their sequence similarity) are associated with an rmsd value equal to rCD, and if rAB < rCD, it is possible to affirm that the similarity between A and B is greater than the similarity between C and D. It is nevertheless impossible to estimate the probability with which the absolute value of the difference rABrCD is different from 0. In other words, it is impossible, on the basis of the rmsd values, to estimate the statistical significance of the value of |rAB rCD| and the probability that this difference is different from the alternative hypothesis that rAB = rCD.

In the present article, we present a procedure that allows one to estimate the statistical significance of the differences between pairs of rmsd values. This is clearly of fundamental importance in all the circumstances in which protein three-dimensional structures are compared in order to extract any biological information, such as, for example, structure–function relationships or evolutionary pathways.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
Rmsd values

Rigid body superpositions were performed on the C{alpha} atoms with the method of Kabsch (1976, 1978). Rmsd values were computed as

Formula 051M1

1
where n is the number of pairs of equivalent C{alpha} atoms and di is the Euclidean distance between the two C{alpha} atom of the ith pair. The rmsd values were standardized as rmsd_100 (Carugo and Pongor, 2001Go)

Formula 051M2

2
where n is the number of residues in the proteins that are compared. Rmsd_100 values are the rmsd that would be measured if the structures that are compared contained 100 residues.

rF values

A protein structure containing N residues can be described with the n = N(N + 1)/2 unique distances between C{alpha} atoms and two three-dimensional models of the same protein can thus be described by two vectors X = (x1, x2, ... , xn) and Y = (y1, y2, ... , yn), containing n elements and where each ith element of X is equivalent to the ith element of Y. The comparison between two protein three-dimensional structures can therefore be performed by comparing the X and Y vectors. This can be performed in a wide variety of ways (Theodoridis and Koutroumbas, 2003Go), for example by computing the Pearson correlation coefficient rP, defined as

Formula 051M3

3
where <x> (or <y>) is defined as

Formula 051M4

4
Robust statistical methods allow one to estimate the probability with which a certain rP value is different from 0. In contrast, the comparison between two rP values cannot be performed in a statistically robust way, since the rP values are bound between –1 and +1 and therefore the rP distribution can be symmetrical only around 0 (Dowdy et al., 2004Go). The comparison between two correlation coefficients is possible if the Pearson correlation coefficient is transformed into the Fisher one, defined as


Formula 051M5

5

In this way, in fact, the comparison between two values rF and rF' is possible, independent of their values, by using a Z-test defined as

Formula 051M6

6
which is normally distributed and can lead to the probability that rF != r'F.

Modeling computations

Protein three-dimensional models were generated by homology modeling, a methodology that allows the prediction of the structure of a target protein on the basis of the similarity between its amino acidic sequence and the sequence of a template protein, the three-dimensional structure of which is known.

We used five target sequences, taken from the CATH database of protein structural domains (Orengo et al., 1997Go) and classified into different Homologous Superfamily clusters. Two hundred template proteins were randomly selected to generate 200 models of each target, among the CATH domains clustered in the same Homologous Superfamily group of the target.

The MODELLER suite of programs was used in the default mode. Although this might be not recommended to produce reliable computational models, especially when the sequence similarity between target and template is very low, this does not affect the results presented in the present article, since here models are necessary only to compute rmsd_100 and rF values for a large number of protein pairs.

Since five targets were used and 5 x 200 = 1000 proteins three-dimensional models were generated, 5 x 19 900 = 99 500 unique pairs of structures were compared.


    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
The statistical validation of the difference between two rmsd_100 values (equation (2)) was performed by exploiting the relationship between the rmsd_100 values and the Fischer correlation coefficient values (rF) (equation (5)), since the latter may be compared through standard and robust statistical techniques (see equation (6) in the Methods section for pertinent details). Rmsd_100 values were computed after optimal superposition of equivalent pairs of C{alpha} atoms (external co-ordinates) and rF values were computed on all pairs of equivalent C{alpha}–C{alpha} distances (internal co-ordinates). The use of different types of co-ordinates is justified by the fact that both are invariant relative to the orientation and position of the pair of protein structures that are compared.

The relationship between rmsd_100 and rF values was determined by using a large set of pairs of protein three-dimensional structures. They were generated by homology modeling procedures with the computer program MODELLER (Marti-Renom et al., 2000Go) (see Methods section for pertinent details). Five protein structural domains were used as sequence targets and 200 computational models were generated for each of them by using 200 structural templates. The five targets were the domains 1a30A0, 1a3z00, 1al0G0, 1a5300 and 1a3h00 of the CATH database (Orengo et al., 1997Go), where they are classified in different fold types. Each structural template was randomly selected among the protein domains that are classified together with the template in CATH (at the classification level termed ‘Homologous Superfamily’). Since 200 models of each target protein were generated, 19 900 unique pairs of models were available. Since five different targets were used, this resulted in 99 500 pairs. The resulting rmsd_100 values were highly variable, with only 25% of the observations being associated with rmsd_100 < 2 Å.

Figure 1 shows the dependence of the rF values on the rmsd_100 values. It can be optimally described by the relationship


Figure 0511
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1.. Dependence of the rF on the rmsd_100 values.

 

Formula 051M7

7

(correlation coefficient = 0.995). It is known that two rF values can be compared with a Z-test, defined in equation (6) (see Methods section). Given the relationship between rF and rmsd_100, it is possible to re-write the Z-test as a function of the rmsd_100 values

Formula 051M8

8
Given that the probability P can be associated with a Z-test value (Dowdy et al., 2004Go), it is therefore possible to determine the relationship between the P values and the differences {Delta}rmsd_100 defined as

Formula 051M9

9
Figure 2 shows the relationships between P and {Delta}rmsd_100, which can be optimally fitted as

Formula 051M10

10
(correlation coefficient = 0.996). It can be seen that only if the two rmsd_100 values differ by at least 1.5 Å there is probability equal to 99% that the difference is not casual. The probability increases to 99.9% if the two rmsd_100 values reach 1.95 Å. Of course, the use of threshold P values over which the difference {Delta}rmsd_100 is considered statistically significant is rather arbitrary since various and different threshold values may be selected by different scientists. Nevertheless, the relationship above is a continuous function that allows the computation of any P value given its corresponding {Delta}rmsd_100 value.


Figure 0512
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2.. Dependence of P on {Delta}rmsd_100. The values of P were computed with equation (10). {Delta}rmsd_100 is defined in equation (9).

 

    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
As an example of validation of the rmsd values, the classification of the filamin structural domains is reported here.

Human filamin, which is expressed in three very similar isoforms (a, b and c), contributes to the organization of the actin-based cytoskeleton. It contains two actin-binding calponin homology domains at the N-terminus followed by a long, flexible rod region made by 24 immunoglobulin-like (Ig) domains. A similar rod containing only six domains is observed in Dictoystelium discoideum gelation factor. This rod region is responsible for several inter-molecular interactions with other cytoskeletal proteins and with various trans-membrane and cytoplasmic cell-signalling proteins (Gorlin et al., 1990Go; Stossel et al., 2001Go). The crystal structure of several of these domains was determined experimentally (Table I).


View this table:
[in this window]
[in a new window]

 
Table I.. List of the domains examined in the present article

 
The classification of these Ig domains on the basis of their tertiary structures can be performed only if the proximity between each pair of domains is determined. This is possible, for example, by an all-against-all superposition. This was performed with combinatorial extension (CE) (Shindyalov and Bourne, 1998Go) and the resulting rmsd values were standardized to rmsd_100 (Carugo and Pongor, 2001Go) (see Table II).


View this table:
[in this window]
[in a new window]

 
Table II.. Rmsd_100 values associated with each pair of the domains examined in the present article

 
A subsequent cluster analysis was performed with the ‘neighbor’ utility of the Phyilip suite of programs (nearest neighbor criterion of similarity) and it resulted into the tree shown in Fig. 3. Since this is typical hierarchical agglomerative cluster analysis, the decision of which is the best number of partitions is rather ambiguous (Theodoridis and Koutroumbas, 2003Go). Among the various possibilities, a reasonable partition (referred to as P_1) might be the following: domains 4 and 5 of gelation factor form a cluster (D1–D6), domains 6 of gelation factor form a second cluster (D12–D15) and a third cluster contains the domains 17, 21 and 24 of human filamin (D7–D10). Another reasonable partition (referred to as P_2) would move domain 24 of human filamin c (D7) into the cluster containing domains 4 and 5 of gelation factor (D1–D6). Both partitions contain three clusters, the only difference being the classification of domain D7, which is grouped with domains D8–D11 in partition P_1 and which is clustered with domains D1–D6 in partition P_2.


Figure 0513
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3.. Classification of the immunoglobulin-like domains of filamin.

 
To decide which of these alternative partitions is better than the other, it is possible to compute the {Delta}rmsd_100 values for two types of protein structures. On the one hand, it is possible to calculate the values of |rmsd_100(ij) – rmsd_100(ik)| in the cases in which the structures i, j and k are classified into the same cluster and, on the other, the values of {Delta}rmsd_100 can be computed in the cases in which the protein domains are classified into different clusters. Both types of {Delta}rmsd_100 values can be associated with a probability, according to equation (10), which indicates its statistical significance. For partition P_1, the probability P assumes the average value of 0.72(2) if the domains are clustered together and of 0.88(1) if they are segregated into different clusters. For partition P_2, the average value of the probability P is 0.67(2) if the domains are clustered into the same group and it is 0.88(1) if the proteins are grouped into different clusters. It can, therefore, be concluded that there is not a significant difference between partitions P_1 and P_2, since the probability of the inter-cluster distance is the same [0.88(1)] and the probability of the intra-cluster probability is very similar (0.72(2) for P_1 and 0.67(2) for P_2). However, given that the statistical significance of the {Delta}rmsd_100 values computed for the domains that are classified into the same cluster is slightly higher in the case of partition P_1 (0.72(2)), it seems reasonable to prefer the other partition (P_2), where the probability that similarly classified domains are different is slightly lower (0.67(2)).

The example shown above is only a rather crude example of the use of the statistical validation of the rmsd values proposed in the present article. It must be remembered that in the case of a hierarchical agglomerative clustering other criteria for determining the optimal partition can be used (Theodoridis and Koutroumbas, 2003Go). The probability values P associated with each pair of rmsd_100 values can nevertheless be used in any other cluster analysis step and can thus provide a better strategy to compare and classify protein tertiary structures.

Furthermore, the P values computed with equation (10) were compared with similar measures of structural similarity provided by CE (Shindyalov and Bourne, 1998Go), a powerful and widely used algorithm for superposing pairs of protein three-dimensional structures. Among the results that CE offers, three are important here, the number of aligned residues (n_ali), which is the number of equivalenced atoms, the rmsd, and the Z-score, which is a measure of probability that the rmsd values did not occur by chance. If the two compared protein structures are identical, n_ali is equal to the number of residues that they contain, rmsd is equal to zero, and Z reaches high values, close to 7. In contrast, n_ali decreases, rmsd increases and Z decreases if the two structures diverge. Z values lower than 3.5–3.8 suggest that there is no significant similarity between the two protein three-dimensional structures that are compared.

Here, 200 000 superpositions were made with CE, by using randomly selected entries of the CATH database (Orengo et al., 1997Go) and for each pair of superpositions the following quantities were recorded: {Delta}rmsd (see equation (9); the standardization from rmsd to rmsd_100 was performed by using n_ali), P (see equation (10)), {Delta}Z = |Z1Z2| and Z1 and Z2.

Obviously, a perfect correlation between the P and {Delta}Z values cannot exists since small {Delta}Z values can be found also in cases in which both Z1 and Z2 are small. If Z1 arises from the comparison between proteins A and B and Z2 is associated with the comparison between proteins C and D, it is expected that also rmsd_AB and rmsd_CD have large values, if Z1 and Z2 are small. However, the difference between rmsd_AB and rmsd_CD is not necessarily small. For example, it is possible that rmsd_AB{approx}rmsd_CD{approx}10 Å (and P{approx}0) but it is also possible that rmsd_AB{approx}10 Å and rmsd_CD{approx}15 Å (and P>>0).

However, on the basis of 200 000 superpositions it appears if both Z1 and Z2 are ≥7, which means that the two structures of each pair are nearly identical, P values are equal, on average, to 0.27, indicating that there is only 27% probability that the two rmsd values associated with the two comparisons are different. In contrast, if both Z1 and Z2 are smaller than 3, indicating that the structures within each pair of proteins are very different, the average P value is 0.93, which means that the probability that the two rmsd of the two comparisons are different is, on average, much higher.

The advantage of using the P values of equation (10) is apparent. The {Delta}Z values that can be extracted from CE superpositions lack a sound statistical significance. In contrast, the P values can be applied, a posteriori, to any algorithm that provides structural alignments and the corresponding rmsd values.


    Conclusions
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
A statistically robust method to estimate the probability with which two rmsd values are different is described in the present communication.

It must be observed that the probability values P (equation (10)) are strictly geometrical, in the sense that they monitor only geometrical features like the positional vectors of the protein atoms. They are thus totally independent on the fact that proteins are linear polymer with strong constraints due to the fact that the distances between adjacent residues cannot assume any real value. It is therefore not surprising that rmsd values of ~2 Å are associated with a statistically significant difference between two protein tertiary structures, though such values are often considered to be quite low and are therefore associated with some relationship between the protein structures that are compared.

It is nevertheless important to outline that the statistical appreciation of the rmsd variations is of fundamental importance in order to make a correct use of this very commonly used measure of structural similarity.


    Footnotes
 
Edited by Joel Sussman Back


    Acknowledgement
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
The Bioinformatics Integration Network II (GAN-AU, Austria) is gratefully acknowledged.


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Conclusions
 Acknowledgement
 References
 
Betancourt M.R. and Skolnick J. (2001) Biopolymers 59:305–309.[CrossRef][Web of Science][Medline]

Carugo O. (2003) J. Appl. Crystallogr. 36:125–128.[CrossRef]

Carugo O. (2006) Curr. Bioinform 1:75–83.

Carugo O. and Eisenhaber F. (1997) J. Appl. Crystallogr. 30:547–549.[CrossRef]

Carugo O. and Pongor S. (2001) Protein Sci. 10:1470–1473.[CrossRef][Web of Science][Medline]

Carugo O. and Pongor S. (2002a) J. Mol. Biol. 315:887–898.[CrossRef][Web of Science][Medline]

Carugo O. and Pongor S. (2002b) Curr. Protein Pept. Sci. 3:441–449.[CrossRef][Web of Science][Medline]

Dowdy S., Wearden S., Chilko D. (2004) Statistics for Research(John Wiley & Sons, Hoboken).

Gorlin J.B., Yamin R., Egan S., Stewart M., Stossel T.P., Kwiatkowski D.J., Hartwig J.H. (1990) J. Cell Biol. 111:1089–1105.[Abstract/Free Full Text]

Kabsch W. (1976) Acta Crystallogr. A32:922–923.[CrossRef]

Kabsch W. (1978) Acta Crystallogr. A34:827–828.[CrossRef]

Kiema T., Lad Y., Jiang P., Oxley C.L., Baldassarre M., Wegener K.L., Campbell I.D., Ylanne J., Calderwood D.A. (2006) Mol. Cell 21:337–341.[CrossRef][Web of Science][Medline]

Maiorov V.N. and Crippen G.M. (1995) Proteins 22:273–283.[CrossRef][Web of Science][Medline]

Marti-Renom M.A., Stuart A., Fiser A., Sanchez R., Melo F., Sali A. (2000) Annu. Rev. Biophys. Biomol. Struct. 29:291–325.[CrossRef][Web of Science][Medline]

McCoy A.J., Fucini P., Noegel A.A., Stewart M. (1999) Nat. Struct. Biol. 6:836–841.[CrossRef][Web of Science][Medline]

Nakamura F., Pudas R., Heikkinen O., Permi P., Kilpelainen I., Munday A.D., Hartwig J.H., Stossel T.P., Ylanne J. (2006) Blood Cells, Mol .Dis. 107:1925–1931.

Orengo C.A., Michie A.D., Jones S., Jones D.T., Swindells M.B., Thornton J.M. (1997) Structure 5:1093–1108.[Medline]

Popowicz G.M., Mueller R., Noegel A.A., Schleicher M., Huber R., Holak T.A. (2004) J. Mol. Biol. 342:1637–1646.[CrossRef][Web of Science][Medline]

Sharpe B.K., Liew C.K., Kwan A.H., Wilce J.A., Crossley M., Matthews J.M., Mackay J.P. (2005) Structure 13:257–266.[Medline]

Shindyalov I.M. and Bourne P.E. (1998) Protein Eng. 11:739–747.[Abstract/Free Full Text]

Stossel T.P., Condeelis J., Cooley L., Hartwig J.H., Noegel A., Schleicher M., Shapiro S.S. (2001) Nat. Rev. Mol. Cell Biol. 2:138–145.[CrossRef][Web of Science][Medline]

Theodoridis S. and Koutroumbas K. (2003) Pattern Recognition(Academic Press, San Diego, USA).

Received July 19, 2006; revised October 12, 2006; accepted November 3, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
20/1/33    most recent
gzl051v2
gzl051v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Carugo, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Carugo, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?