PEDS Advance Access first published online on January 11, 2007
This version published online on January 12, 2007
Protein Engineering Design and Selection, doi:10.1093/protein/gzl051
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical validation of the root-mean-square-distance, a measure of protein structural proximity
1 Department of General Chemistry, University of Pavia, Pavia, Italy 2 Department of Biomolecular Structural Chemistry, Max F. Perutz Laboratories, Vienna University, Campus Vienna Biocenter 5, A-1030 Vienna, Austria
3 To whom correspondence should be addressed. Email: oliviero.carugo{at}univie.ac.at
| Abstract |
|---|
|
|
|---|
Despite its well-documented limitations, the root-mean-square-distance (rmsd) between pairs of equivalent atoms is routinely used to monitor the degree of similarity between two optimally superposed protein three-dimensional structures. A robust method for assessing the statistical significance of the difference between two rmsd values is presented here. It is based on the comparison of two protein structures through the correlation coefficient between equivalent inter-atomic distances and the subsequent application of the Fisher transformation that allows one to estimate the probability of identity between two correlation coefficient values. The relationship between the rmsd and Fisher correlation coefficient allows then to estimate the statistical significance of the difference between two rmsd values. Such a procedure is exemplified with the analysis of the possible classifications of the immunoglobulin-like domains of filamin and is compared to related estimations of structural similarity. The possibility to estimate the probability of the difference between two rmsd values can be used to optimize the protein structural classifications and comparisons, independent of the procedure used to derive the rmsds.
Keywords: filamin rod domain/protein structural similarity/protein structure classification/root-mean-square-distance
| Introduction |
|---|
|
|
|---|
The similarity between two protein three-dimensional structures is usually measured through the root-mean-square-distance (rmsd) between pairs of equivalent C
atoms, computed after optimal superposition of the two structures. This is done either when two conformations of the same protein (bound/unbound, monomeric/oligomeric, etc.) are compared or when the comparison involves two different proteins that have different amino acidic sequences, though the equivalencies between pairs of C
atoms may be defined or discovered differently, depending on the degree of similarity between the sequences of the two proteins that are compared. The use of the rmsd might seem quite peculiar, since many other similarity measures were proposed and used in various fields of structural and molecular biology (Carugo and Eisenhaber, 1997
Moreover, the rmsd does not behave as a metrics, in the mathematical sense, unless the structures that are compared are very similar to each other (Maiorov and Crippen, 1995
; Betancourt and Skolnick, 2001
). For this reason, the rmsd_100 was proposed (Carugo and Pongor, 2001
), which is the rmsd value that would be measured if the structures that are compared contained 100 residues. It was also observed that the rmsd values depend on the accuracy of the experimentally determined protein structures (Carugo, 2003
). On average, smaller rmsd values are observed for protein structure pairs at better resolution and the rmsd values tend to increase if the two proteins that are compared were refined at different resolutions.
These drawbacks make it difficult to use rmsd values when several protein three-dimensional structures are compared in order to classify them through multivariate statistical techniques of machine learning algorithms. Nevertheless, it is still very common for structural biologists to use and publish rmsd values.
An open question about rmsd is the statistical significance of rmsd differences. In other words, if the two protein structures A and B (independent of their sequence similarity) are associated with an rmsd value equal to rAB and the two protein structures C and D (independent of their sequence similarity) are associated with an rmsd value equal to rCD, and if rAB < rCD, it is possible to affirm that the similarity between A and B is greater than the similarity between C and D. It is nevertheless impossible to estimate the probability with which the absolute value of the difference rAB rCD is different from 0. In other words, it is impossible, on the basis of the rmsd values, to estimate the statistical significance of the value of |rAB rCD| and the probability that this difference is different from the alternative hypothesis that rAB = rCD.
In the present article, we present a procedure that allows one to estimate the statistical significance of the differences between pairs of rmsd values. This is clearly of fundamental importance in all the circumstances in which protein three-dimensional structures are compared in order to extract any biological information, such as, for example, structurefunction relationships or evolutionary pathways.
| Methods |
|---|
|
|
|---|
Rmsd values
Rigid body superpositions were performed on the C
atoms with the method of Kabsch (1976, 1978). Rmsd values were computed as
|
| 1 |
atoms and di is the Euclidean distance between the two C
atom of the ith pair. The rmsd values were standardized as rmsd_100 (Carugo and Pongor, 2001|
| 2 |
A protein structure containing N residues can be described with the n = N(N + 1)/2 unique distances between C
atoms and two three-dimensional models of the same protein can thus be described by two vectors X = (x1, x2, ... , xn) and Y = (y1, y2, ... , yn), containing n elements and where each ith element of X is equivalent to the ith element of Y. The comparison between two protein three-dimensional structures can therefore be performed by comparing the X and Y vectors. This can be performed in a wide variety of ways (Theodoridis and Koutroumbas, 2003
), for example by computing the Pearson correlation coefficient rP, defined as
|
| 3 |
x
(or
y
) is defined as
|
| 4 |
|
| 5 |
In this way, in fact, the comparison between two values rF and rF' is possible, independent of their values, by using a Z-test defined as
|
| 6 |
r'F. Protein three-dimensional models were generated by homology modeling, a methodology that allows the prediction of the structure of a target protein on the basis of the similarity between its amino acidic sequence and the sequence of a template protein, the three-dimensional structure of which is known.
We used five target sequences, taken from the CATH database of protein structural domains (Orengo et al., 1997
) and classified into different Homologous Superfamily clusters. Two hundred template proteins were randomly selected to generate 200 models of each target, among the CATH domains clustered in the same Homologous Superfamily group of the target.
The MODELLER suite of programs was used in the default mode. Although this might be not recommended to produce reliable computational models, especially when the sequence similarity between target and template is very low, this does not affect the results presented in the present article, since here models are necessary only to compute rmsd_100 and rF values for a large number of protein pairs.
Since five targets were used and 5 x 200 = 1000 proteins three-dimensional models were generated, 5 x 19 900 = 99 500 unique pairs of structures were compared.
| Results |
|---|
|
|
|---|
The statistical validation of the difference between two rmsd_100 values (equation (2)) was performed by exploiting the relationship between the rmsd_100 values and the Fischer correlation coefficient values (rF) (equation (5)), since the latter may be compared through standard and robust statistical techniques (see equation (6) in the Methods section for pertinent details). Rmsd_100 values were computed after optimal superposition of equivalent pairs of C
atoms (external co-ordinates) and rF values were computed on all pairs of equivalent C
C
distances (internal co-ordinates). The use of different types of co-ordinates is justified by the fact that both are invariant relative to the orientation and position of the pair of protein structures that are compared.
The relationship between rmsd_100 and rF values was determined by using a large set of pairs of protein three-dimensional structures. They were generated by homology modeling procedures with the computer program MODELLER (Marti-Renom et al., 2000
) (see Methods section for pertinent details). Five protein structural domains were used as sequence targets and 200 computational models were generated for each of them by using 200 structural templates. The five targets were the domains 1a30A0, 1a3z00, 1al0G0, 1a5300 and 1a3h00 of the CATH database (Orengo et al., 1997
), where they are classified in different fold types. Each structural template was randomly selected among the protein domains that are classified together with the template in CATH (at the classification level termed Homologous Superfamily). Since 200 models of each target protein were generated, 19 900 unique pairs of models were available. Since five different targets were used, this resulted in 99 500 pairs. The resulting rmsd_100 values were highly variable, with only 25% of the observations being associated with rmsd_100 < 2 Å.
Figure 1 shows the dependence of the rF values on the rmsd_100 values. It can be optimally described by the relationship
|
|
| 7 |
(correlation coefficient = 0.995). It is known that two rF values can be compared with a Z-test, defined in equation (6) (see Methods section). Given the relationship between rF and rmsd_100, it is possible to re-write the Z-test as a function of the rmsd_100 values
|
| 8 |
rmsd_100 defined as
|
| 9 |
rmsd_100, which can be optimally fitted as
|
| 10 |
rmsd_100 is considered statistically significant is rather arbitrary since various and different threshold values may be selected by different scientists. Nevertheless, the relationship above is a continuous function that allows the computation of any P value given its corresponding
rmsd_100 value.
|
| Discussion |
|---|
|
|
|---|
As an example of validation of the rmsd values, the classification of the filamin structural domains is reported here.
Human filamin, which is expressed in three very similar isoforms (a, b and c), contributes to the organization of the actin-based cytoskeleton. It contains two actin-binding calponin homology domains at the N-terminus followed by a long, flexible rod region made by 24 immunoglobulin-like (Ig) domains. A similar rod containing only six domains is observed in Dictoystelium discoideum gelation factor. This rod region is responsible for several inter-molecular interactions with other cytoskeletal proteins and with various trans-membrane and cytoplasmic cell-signalling proteins (Gorlin et al., 1990
; Stossel et al., 2001
). The crystal structure of several of these domains was determined experimentally (Table I).
|
The classification of these Ig domains on the basis of their tertiary structures can be performed only if the proximity between each pair of domains is determined. This is possible, for example, by an all-against-all superposition. This was performed with combinatorial extension (CE) (Shindyalov and Bourne, 1998
|
A subsequent cluster analysis was performed with the neighbor utility of the Phyilip suite of programs (nearest neighbor criterion of similarity) and it resulted into the tree shown in Fig. 3. Since this is typical hierarchical agglomerative cluster analysis, the decision of which is the best number of partitions is rather ambiguous (Theodoridis and Koutroumbas, 2003
|
To decide which of these alternative partitions is better than the other, it is possible to compute the
rmsd_100 values for two types of protein structures. On the one hand, it is possible to calculate the values of |rmsd_100(i, j) rmsd_100(i, k)| in the cases in which the structures i, j and k are classified into the same cluster and, on the other, the values of
rmsd_100 can be computed in the cases in which the protein domains are classified into different clusters. Both types of
rmsd_100 values can be associated with a probability, according to equation (10), which indicates its statistical significance. For partition P_1, the probability P assumes the average value of 0.72(2) if the domains are clustered together and of 0.88(1) if they are segregated into different clusters. For partition P_2, the average value of the probability P is 0.67(2) if the domains are clustered into the same group and it is 0.88(1) if the proteins are grouped into different clusters. It can, therefore, be concluded that there is not a significant difference between partitions P_1 and P_2, since the probability of the inter-cluster distance is the same [0.88(1)] and the probability of the intra-cluster probability is very similar (0.72(2) for P_1 and 0.67(2) for P_2). However, given that the statistical significance of the
rmsd_100 values computed for the domains that are classified into the same cluster is slightly higher in the case of partition P_1 (0.72(2)), it seems reasonable to prefer the other partition (P_2), where the probability that similarly classified domains are different is slightly lower (0.67(2)).
The example shown above is only a rather crude example of the use of the statistical validation of the rmsd values proposed in the present article. It must be remembered that in the case of a hierarchical agglomerative clustering other criteria for determining the optimal partition can be used (Theodoridis and Koutroumbas, 2003
). The probability values P associated with each pair of rmsd_100 values can nevertheless be used in any other cluster analysis step and can thus provide a better strategy to compare and classify protein tertiary structures.
Furthermore, the P values computed with equation (10) were compared with similar measures of structural similarity provided by CE (Shindyalov and Bourne, 1998
), a powerful and widely used algorithm for superposing pairs of protein three-dimensional structures. Among the results that CE offers, three are important here, the number of aligned residues (n_ali), which is the number of equivalenced atoms, the rmsd, and the Z-score, which is a measure of probability that the rmsd values did not occur by chance. If the two compared protein structures are identical, n_ali is equal to the number of residues that they contain, rmsd is equal to zero, and Z reaches high values, close to 7. In contrast, n_ali decreases, rmsd increases and Z decreases if the two structures diverge. Z values lower than 3.53.8 suggest that there is no significant similarity between the two protein three-dimensional structures that are compared.
Here, 200 000 superpositions were made with CE, by using randomly selected entries of the CATH database (Orengo et al., 1997
) and for each pair of superpositions the following quantities were recorded:
rmsd (see equation (9); the standardization from rmsd to rmsd_100 was performed by using n_ali), P (see equation (10)),
Z = |Z1 Z2| and Z1 and Z2.
Obviously, a perfect correlation between the P and
Z values cannot exists since small
Z values can be found also in cases in which both Z1 and Z2 are small. If Z1 arises from the comparison between proteins A and B and Z2 is associated with the comparison between proteins C and D, it is expected that also rmsd_AB and rmsd_CD have large values, if Z1 and Z2 are small. However, the difference between rmsd_AB and rmsd_CD is not necessarily small. For example, it is possible that rmsd_AB
rmsd_CD
10 Å (and P
0) but it is also possible that rmsd_AB
10 Å and rmsd_CD
15 Å (and P>>0).
However, on the basis of 200 000 superpositions it appears if both Z1 and Z2 are
7, which means that the two structures of each pair are nearly identical, P values are equal, on average, to 0.27, indicating that there is only 27% probability that the two rmsd values associated with the two comparisons are different. In contrast, if both Z1 and Z2 are smaller than 3, indicating that the structures within each pair of proteins are very different, the average P value is 0.93, which means that the probability that the two rmsd of the two comparisons are different is, on average, much higher.
The advantage of using the P values of equation (10) is apparent. The
Z values that can be extracted from CE superpositions lack a sound statistical significance. In contrast, the P values can be applied, a posteriori, to any algorithm that provides structural alignments and the corresponding rmsd values.
| Conclusions |
|---|
|
|
|---|
A statistically robust method to estimate the probability with which two rmsd values are different is described in the present communication.
It must be observed that the probability values P (equation (10)) are strictly geometrical, in the sense that they monitor only geometrical features like the positional vectors of the protein atoms. They are thus totally independent on the fact that proteins are linear polymer with strong constraints due to the fact that the distances between adjacent residues cannot assume any real value. It is therefore not surprising that rmsd values of
2 Å are associated with a statistically significant difference between two protein tertiary structures, though such values are often considered to be quite low and are therefore associated with some relationship between the protein structures that are compared.
It is nevertheless important to outline that the statistical appreciation of the rmsd variations is of fundamental importance in order to make a correct use of this very commonly used measure of structural similarity.
| Footnotes |
|---|
Edited by Joel Sussman
| Acknowledgement |
|---|
|
|
|---|
The Bioinformatics Integration Network II (GAN-AU, Austria) is gratefully acknowledged.
| References |
|---|
|
|
|---|
Betancourt M.R. and Skolnick J. (2001) Biopolymers 59:305309.[CrossRef][Web of Science][Medline]
Carugo O. (2003) J. Appl. Crystallogr. 36:125128.[CrossRef]
Carugo O. (2006) Curr. Bioinform 1:7583.
Carugo O. and Eisenhaber F. (1997) J. Appl. Crystallogr. 30:547549.[CrossRef]
Carugo O. and Pongor S. (2001) Protein Sci. 10:14701473.[CrossRef][Web of Science][Medline]
Carugo O. and Pongor S. (2002a) J. Mol. Biol. 315:887898.[CrossRef][Web of Science][Medline]
Carugo O. and Pongor S. (2002b) Curr. Protein Pept. Sci. 3:441449.[CrossRef][Web of Science][Medline]
Dowdy S., Wearden S., Chilko D. (2004) Statistics for Research(John Wiley & Sons, Hoboken).
Gorlin J.B., Yamin R., Egan S., Stewart M., Stossel T.P., Kwiatkowski D.J., Hartwig J.H. (1990) J. Cell Biol. 111:10891105.
Kabsch W. (1976) Acta Crystallogr. A32:922923.[CrossRef]
Kabsch W. (1978) Acta Crystallogr. A34:827828.[CrossRef]
Kiema T., Lad Y., Jiang P., Oxley C.L., Baldassarre M., Wegener K.L., Campbell I.D., Ylanne J., Calderwood D.A. (2006) Mol. Cell 21:337341.[CrossRef][Web of Science][Medline]
Maiorov V.N. and Crippen G.M. (1995) Proteins 22:273283.[CrossRef][Web of Science][Medline]
Marti-Renom M.A., Stuart A., Fiser A., Sanchez R., Melo F., Sali A. (2000) Annu. Rev. Biophys. Biomol. Struct. 29:291325.[CrossRef][Web of Science][Medline]
McCoy A.J., Fucini P., Noegel A.A., Stewart M. (1999) Nat. Struct. Biol. 6:836841.[CrossRef][Web of Science][Medline]
Nakamura F., Pudas R., Heikkinen O., Permi P., Kilpelainen I., Munday A.D., Hartwig J.H., Stossel T.P., Ylanne J. (2006) Blood Cells, Mol .Dis. 107:19251931.
Orengo C.A., Michie A.D., Jones S., Jones D.T., Swindells M.B., Thornton J.M. (1997) Structure 5:10931108.[Medline]
Popowicz G.M., Mueller R., Noegel A.A., Schleicher M., Huber R., Holak T.A. (2004) J. Mol. Biol. 342:16371646.[CrossRef][Web of Science][Medline]
Sharpe B.K., Liew C.K., Kwan A.H., Wilce J.A., Crossley M., Matthews J.M., Mackay J.P. (2005) Structure 13:257266.[Medline]
Shindyalov I.M. and Bourne P.E. (1998) Protein Eng. 11:739747.
Stossel T.P., Condeelis J., Cooley L., Hartwig J.H., Noegel A., Schleicher M., Shapiro S.S. (2001) Nat. Rev. Mol. Cell Biol. 2:138145.[CrossRef][Web of Science][Medline]
Theodoridis S. and Koutroumbas K. (2003) Pattern Recognition(Academic Press, San Diego, USA).
Received July 19, 2006; revised October 12, 2006; accepted November 3, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


