PEDS Advance Access published online on April 14, 2008
Protein Engineering Design and Selection, doi:10.1093/protein/gzn011
Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations
1Department of Bioengineering 2Biomolecular Structure and Design Program 3Biomedical and Health Informatics, University of Washington, Box 355013, Seattle, WA 98195-5013, USA
6 To whom correspondence should be addressed. E-mail: daggett{at}u.washington.edu
| Abstract |
|---|
|
|
|---|
The goal of Dynameomics is to perform atomistic molecular dynamics (MD) simulations of representative proteins from all known folds in explicit water in their native state and along their thermal unfolding pathways. Here we present 188-fold representatives and their native state simulations and analyses. These 188 targets represent 67% of all the structures in the Protein Data Bank. The behavior of several specific targets is highlighted to illustrate general properties in the full dataset and to demonstrate the role of MD in understanding protein function and stability. As an example of what can be learned from mining the Dynameomics database, we identified a protein fold with heightened localized dynamics. In one member of this fold family, the motion affects the exposure of its phosphorylation site and acts as an entropy sink to offset another portion of the protein that is relatively immobile in order to present a consistent interface for protein docking. In another member of this family, a polymorphism in the highly mobile region leads to a host of disease phenotypes. We have constructed a web site to provide access to a novel hybrid relational/multidimensional database (described in the succeeding two papers) to view and interrogate simulations of the top 30 targets: http://www.dynameomics.org. The Dynameomics database, currently the largest collection of protein simulations and protein structures in the world, should also be useful for determining the rules governing protein folding and kinetic stability, which should aid in deciphering genomic information and for protein engineering and design.
Keywords: database/Dynameomics/molecular dynamics/protein dynamics/protein folds
| Introduction |
|---|
|
|
|---|
The Protein Data Bank (PDB) (Berman et al., 2000
The processes by which proteins convert between states on the folding pathway or substates in a given conformational ensemble are difficult to probe experimentally. Consequently, realistic all-atom explicit solvent molecular dynamics (MD) simulation (Beck and Daggett, 2004
) is an attractive approach because of its ability to provide atomic detail of protein dynamics, function, folding and unfolding. MD has been used to study a wide variety of proteins, mini-proteins and peptides for purposes as diverse as protein folding, drug design, nuclear magnetic resonance (NMR) structure determination and protein structure prediction validation. In addition, there are a number of diseases associated with protein instability and unfolding, such as amyloid diseases, and MD has been instrumental in providing structural information in these cases (Daggett, 2006
).
Our simulations of cytochrome b5 provide an example of new functional information that only became apparent through the excursions from the average observed in MD simulations. From a native-state simulation we discovered a novel, dynamic and hydrophobic cleft on the surface of the protein (Storch and Daggett, 1995
). This cleft is unobservable in the crystal structure; however, in the MD simulation, it opens and transiently exposes hydrophobic residues and provides access to the heme group, which is required for function. We proposed that this cleft would provide an ideal position for docking of the proteins electron transfer partners, allowing for protected transfer of the electron through a channel lined with aromatic residues. Subsequently, we performed experiments that verified that the protein moves as predicted from MD (Storch et al., 1999a, b). Additional MD and experimental NMR studies of the cytochrome c—cytochrome b5 complex indicated that cytochrome c does indeed bind to the cleft identified from our MD simulations, which uniquely explains the experimental data (Hom et al., 2000
). Thus, MD simulations have the ability to reveal many interesting biologically relevant phenomena that cannot be seen in static structures.
Here, we combine MD for studying protein dynamics with a high-throughput approach to simulation and annotation of protein fold space. It is our intention to simulate at least one representative of each known protein fold in its native biological state and along its thermal unfolding pathway, a project which we have dubbed Dynameomics. Both experimental and theoretical studies comparing different members of a fold family suggest that for many folds simulating a single representative may be sufficient to describe the general behavior of the entire family (Clarke et al., 1999
; Gunasekaran et al., 2001
; Gianni et al., 2003
). Therefore, we are primarily restricting ourselves to one representative from each fold; however, we are expanding our simulations to include multiple representatives of more highly populated folds.
To date, we have performed over 3000 simulations of more than 400 proteins, resulting in 103 times more structures than the PDB. Here we focus on an initial set of 188 native-state simulations, which are the result of working our way down our previously described population-ranked target list of 1130 folds from most- to least-populated (Day et al., 2003
). In choosing targets for our fold list, we focused on proteins with well-resolved experimental structures, biomedical relevance and with experimental data available to validate the simulations. Examples of some of the medically relevant proteins in our dataset include: amyloid β-precursor protein involved in Alzheimers disease; glutathione S-transferase (GST), which is involved in resistance to chemotherapy; HIV-1 protease; MAP30, which is implicated in HIV and cancer, and serum amyloid P component, amyloidosis. Our fold list was also used recently by another group to choose targets for simulations comparing different force fields (Rueda et al., 2007
). For the 188 simulations presented here, we include several measures of validation of the trajectories against available experimental data and summary statistics derived from the trajectories. In addition, we present specific simulations in detail to demonstrate general phenomena observed across the dataset. Due to the size and complexity of the database constructed from the simulations and their analyses, we were forced to develop novel strategies for storing, managing and analyzing our multi-terabyte dataset in the accompanying papers (Kehl et al., 2008
; Simms et al., 2008
). The simulations and analyses of the top 30 folds by population are available at our public web site for researchers to browse, search and download at http://www.dynameomics.org.
| Methods |
|---|
|
|
|---|
The Dynameomics protocol can be broken down into five steps (Fig. 1). The first step, known as Target Selection is where protein structures are selected as fold representatives from the PDB. The second step involved preparing the protein structure from the PDB for all-atom protein simulations in a bath of explicit water molecules at 298 Kelvin (K), which is the third step. Subsequent to simulation, in the fourth step, a standardized set of 32 analyses are calculated for the trajectory. Finally, in step five, the simulations (i.e. coordinates) and analyses are loaded into a database for annotation, browsing and data-mining. A more complete discussion of the fifth step is available in the accompanying papers (Kehl et al., 2008
|
Step 1: target selection from folds
Protein folds were identified by a consensus approach of three popular fold/domain classification systems: (i) structural classification of proteins (SCOP) v1.65 (Murzin et al., 1995
); (ii) class, architecture, topology and homologous superfamily (CATH) v2.4 (Orengo et al., 1997
); (iii) Dali Domain Dictionary v3.1β (Dietmann et al., 2001
) as described previously (Day et al., 2003
). In that work, a fold was defined as a set of protein chains that shared two or more classification assignments from these three well-known systems. For each fold, individual targets were selected from the structures in the PDB. Selection criteria were such that single chains were preferred over multimers and multi-domain proteins, shorter chains were preferred over longer ones and high-resolution crystal structures were preferred over those of lower resolution and those derived from NMR. For folds where several candidate targets remained after applying these filters, the final target selection was made giving deference to those systems with available experimental data in the forms of NMR observables, folding and unfolding studies, drug design interest, many protein–protein interactions or biomedical relevance.
At this time, we have simulated and analyzed 188 targets. Their PDB codes, chain and segment specifications, fold name, source organism and CATH class codes are shown in Table I. These targets represent 181 folds from among the 450 most populated folds. The 181 subset was chosen from the larger collection of 450 because their targets were the most amenable to simulation. In particular, each targets main chain was contiguous in sequence and was long enough that we expected it to fold autonomously. The 181 folds represent 67% of known protein domains. The discrepancy between the number of targets (188) and folds (181) is a result of two special cases: (i) multiple targets for eight of the folds were simulated where a fold was of special interest (e.g. the Rossmann fold of rank 2, DNA/RNA-binding 3-helix bundles of rank 6, and SH3 barrels of rank 17), and (ii) 2-folds were simulated together in one target. The latter case only applied to the GST target, which contains two specific folds: the N- and C-terminal domains of GST folds (ranks 12 and 27, respectively).
|
We present a breakdown of the 188 targets by their experimental source (X-ray crystallography, NMR) and how much of the source PDB was simulated in Table II. For the 72% of targets whose structures were solved by X-ray crystallography, the mean resolution was 1.97 ± 0.38 Ångstroms (Å). Table II distinguishes between targets where the entire protein chain(s) in the PDB (61%) was used and those cases where an entire subunit, i.e. a single covalently bound chain (13%) or a partial subunit was used, i.e. the protein chain was cut (26%). The mean target length was 127 ± 77 residues with the smallest target having 29 residues and the largest 411 residues (per monomer).
|
Step 2: target structure preparation
Before MD simulations can be performed, the structure must be abstracted from the PDB, missing atoms added, energy minimized and then solvated. When the structure was determined by NMR, the first model in the PDB was used. When multiple conformations for specific residues were present in the PDB, the first conformation was chosen. Hydrogen was then added and disulfide bonds were included where indicated in the PDB file by SSBOND records. Then the coordinates of the structure were minimized with steepest descent (SD) minimization for 1000 steps or until the potential energy converged. The minimized structure was then solvated in a box of water extending at least 10 Å from the protein. No waters were permitted to be closer than 1.8 Å to any protein heavy atom. The box size was adjusted slightly such that its density matched the experimental value of: 0.997 g/ml for 298 K (Kell, 1967
). Then, the water molecules (only) were subjected to 1000 steps of SD minimization, followed by 1 ps of MD where the temperature of the water was raised to approximately 40 K. The water molecules were again then SD minimized for 500 steps and, finally, the protein was SD minimized for 500 steps.
As a quality control measure, the resulting molecular systems were individually inspected. Problems such as minimization or preparation failures (e.g. failure of the total energy to decrease during minimization, typically a result of omission of a disulfide bond), and large structural changes in the protein were identified and corrected. To ensure that all systems were correct, each prepped structure was examined by hand.
Step 3: molecular dynamics simulation of target
All-atom, explicit solvent MD simulations of the 188 targets were conducted. For each target, a simulation at 298 K was conducted for at least 21 nanoseconds (ns). Structures were saved from the simulation at 1 ps resolution (yielding a trajectory) for later analyses and full restart files with velocities and accelerations were saved once per nanosecond.
The simulations were performed in keeping with our core MD protocols and simulation methods that are published elsewhere (Beck and Daggett, 2004
; Beck et al., 2005
). The simulations used fully flexible representations of protein and solvent in the Levitt et al. force field with our standard parameters (Levitt et al., 1995
, 1997
). The flexible three center (F3C) water model was used as it has demonstrated success in reproducing a large number of experimental observables across a range of temperatures (Levitt et al., 1997
; Beck et al., 2003
). The microcanonical ensemble was used, where the number of atoms, unit cell volume and total energy were conserved (NVE), as well as linear momentum (NVEp). The simulation timestep was 2 femtoseconds (fs) such that each picosecond (ps) of simulation time required 500 steps (cycles of force field evaluation and numerical integration). In turn, a 21 ns simulation requires 1.05 x 107 steps. The unit cell was periodic with minimum image conventions employed (Allen and Tildesley, 1987
). For the non-bonded interactions, a force-shifted 10 Å cut-off range was used (Levitt et al., 1995
). Non-bonded interactions between atoms in charge groups separated by three bonds were included and scaled by 0.4. The non-bonded interaction pairlists were used for no more than 6 fs (i.e. they were updated every three steps).
The simulations were performed with in lucem Molecular Mechanics (ilmm) (Beck et al., 2000–2008). A variety of computational resources were employed for the simulations. Most of the simulations were performed on Seaborg: the Department of Energys National Energy Research Supercomputing Councils 6080 processor supercomputer. Each IBM AIX based supercomputer node (IBM Nighthawk) has 16 POWER3 processors operating at 375 Mhz. In addition, some of the more recent simulations were performed on quad-core Intel Woodcrest 5130 nodes (2.00 Ghz) running Windows Server 2003 R2.
Step 4: analysis of molecular dynamics trajectories
Each trajectory was subjected to an extensive array of analyses. Broadly, these analyses can be summarized as tools to characterize gross structural changes, assessment of secondary structural changes, calculation of the number of contacts between protein atoms and solvent accessible surface area (SASA). The set of analyses and how they were calculated is shown in Table III. Some analyses could be broken down by side chain or main chain, polar or non-polar, native (i.e. present in the crystal structure) or non-native (i.e. not present in the crystal structure) and various combinations of these could be created, e.g. native non-polar contacts between side-chains. Such combinations bring the total number of properties to 32.
|
The trajectories were compared to NMR data where the experimental data were available from the BioMagResBank (BMRB) in formats that could be directly parsed or readily converted by the DOCR tools (Doreleijers et al., 2003
Residue-based contact preferences were generated from the simulations and the starting structures. We define the contact preference as the negative logarithm of the ratio of the number of observed contacts and the number of expected contacts (from some reference state), multiplied by the Boltzmann constant (k) and temperature (T). A modified version of the method of Godzik et al. (1995)
in which buried and solvent-exposed residues were considered, was used to calculate the reference state. This method yields 210 pairwise (i.e. Ala to Ala, Ala to Arg) residue contact preferences per dataset. Differences between corresponding pairs in the starting structures and the simulations were examined to identify shifts in contact preferences.
In a number of instances, analyses were compared to data derived from static structures in the PDB. Taken as its whole, the PDB is highly redundant, i.e. there are many structures of SH3 domains. To avoid problems associated with the inherent oversampling of the PDB, the ASTRAL-40 database (v1.65, Brenner et al., 2000
; Chandonia et al., 2002
, 2004
) was used as a biased reduced set of static structures. This database is a subset of SCOP and is comprised of 5674 structures in the PDB whose sequences have less then 40% sequence identity.
Mining the Dynameomics database for helix motion
In addition to the above analyses, we used the Dynameomics database (Kehl et al., 2008
; Simms et al., 2008
) to search for pairs of helices in contact at the start of the simulations and then to calculate the angle between them over the course of the trajectories. We began by using a SQL query against the DSSP analysis stored in the database to identify the location of helices at least 7 residues long in the simulation starting structures. Next, also using a SQL query, we used the coordinates of heavy atoms in the previously identified helices to find pairs of helices in the starting structures that had 10 or more contacts between them (see the contact definition in Table III). Then, Mathematica (Wolfram Research, 2005) was connected to the Dynameomics database to read in the list of pairs of helices in contact. Using Mathematica, we computed the angle between the first principal components calculated from the coordinates of each helixs C
atoms over the course of the trajectory and wrote these data back to the database for subsequent statistical analysis. Pairs of helices with significant fluctuations in this angle were identified by finding those angles with large standard deviations.
| Results |
|---|
|
|
|---|
The combined simulation time for the native-state simulations of the 188 targets presented here was 6.5 µs. Due to how simulations were packaged to run on the National Energy Resource Supercomputing Center (NERSC) supercomputer, not all simulations were conducted for precisely the same amount of time. However, every simulation was at least 21 ns long with an average length of 29.4 ± 7.1 ns. The resulting coordinate data for the trajectories required 0.5 terabyte (TB) in a binary format of our own design, at approximately 50% compression. The analyses in Table III were applied to the trajectories resulting in an additional 0.33 TB of data.
Computational assessments of simulation quality
As a check on the quality of the simulations, we can monitor the drift in total energy (sum of potential and kinetic at each simulation timestep). This assessment is unique to the microcanonical (NVE) ensemble, as conservation of energy is inherent in the classical equations of motion. However, the numerical integration of these equations with the limited precision inherent to all computer algorithms results in some amount of energy drift. We evaluate this drift in terms of the percentage of total energy change per ns. This value is quite low for all simulations, and ranged from 0.13 to 0.20% per ns. The energy change per ns scaled linearly with the number of atoms in the system, as did the total energy of the system. As temperature is not a fixed quantity in this ensemble, it can fluctuate. Over the last 10 ns of our simulations, the mean temperature was 298 ± 4.6 K.
Assessment of simulation quality by comparison with experiment
MD simulations can be validated by comparison with NMR observables (Table IV) such as Nuclear Overhauser effect crosspeaks (NOEs), which report on hydrogen that is close in space. NOE sets were readily available in a parsable form from BMRB (Doreleijers et al., 2003
) for 27 of our 188 proteins. The mean percentage of NOEs satisfied across the 27 targets was 92 ± 4% (Table IV). In our experience, more extensive sampling improves the level of agreement. In at least two cases (engrailed homeodomain, rank 6 and ubiquitin, rank 9) where a crystal structure was used and NMR NOE data were available, we found that the crystal structure satisfied fewer NOEs than did the simulation. These cases highlight how MD can improve our understanding of protein behavior by ensemble sampling in solution rather than using a single static structure.
|
Proton chemical shifts from NMR were also found to be in good agreement with experiment for the 15 proteins for which data were available in a readily processed format from the BMRB. Such chemical shifts reflect the local environment around the atom in question. The correlation coefficient between the predicted proton chemical shifts from the simulations and experiment was 0.96 (Table IV).
Case study comparison with experiment: ubiquitin
To illustrate the comparison between simulation and different experimental results, we focus on ubiquitin. For ubiquitin [PDB:1ubq], the crystal structure satisfied 94.4% of 2727 NOEs. The 21 ns simulation satisfied 95.2% of the NOEs, showing a net gain in satisfaction of 19 long-range NOEs and a decrease in the mean violation distance from 0.84 to 0.65 Å relative to the crystal structure. While 70 long-range NOEs (i
i+5 or greater) were gained, 58 were lost by the sampling of the simulation. Of the 58 lost NOEs, 41 were violated by <0.5 Å, 11 were violated by <1 Å, 5 were violated by <2 Å and 1 was violated by >2 Å. The experimental proton chemical shifts from experiment were compared with the MD-predicted values, and the correlation coefficient between the two sets of values was 0.98.
Comparison with B-factors is also of interest because they reflect the extent of mobility of the protein in its crystalline state. B-factors contain contributions from internal motion (modeled as root mean square fluctuations about the mean) as well as lattice disorder and phasing. Thus, the comparison with mobility in solution can be problematic. Nevertheless, the B-factors (Vijay-Kumar et al., 1987
) and C
RMSF values (Figs 2A and B) both indicate that the tail residues are dynamic and the correlation between the two sets is acceptable (R = 0.73), particularly given the different environments. Ubiquitin is attached through its C-terminal tail to proteins to tag them for degradation. The C
RMSD calculated for all residues and for all residues excluding the tails (residues 2 to 71) were compared (Fig. 2C); the tails contribute an average of
0.4 Å C
RMSD throughout the simulation. The S2 values for NH bond motion from NMR relaxation experiments of ubiquitin (Schneider et al., 1992
) were also compared to S2MD values calculated from the simulations (Wong and Daggett, 1998
) (Fig. 2D), and the correlation is acceptable (R = 0.76).
|
Summary statistics from the Dynameomics simulations
Summary statistics based on the set of properties calculated for all of our trajectories (see Table III) are included in Tables V and VI. In order to aggregate these statistics in a consistent manner from a set of proteins that is designed to be varied in structure and chain length, we have used two normalization approaches. The first was to normalize the individual properties by the number of residues in the target of origin, before aggregation. For properties like radius of gyration, C
RMSD and fraction of residues in a given secondary structure type, this normalization is inherent to the underlying analysis. In other cases, the final units are per residue, e.g. SASA has units of Å2/residue. The second approach was to normalize by the mean value derived from each simulations equilibration period (0–1 ns). This essentially converts the quantities into fractional changes in the simulation relative to the equilibration values. Table V breaks down these statistics into protein structural classes based on the targets folds, i.e. all
, all β,
/β,
+β. Table VI breaks these statistics down into the experimental source for each target, i.e. X-ray crystallography or NMR. Interestingly, comparison of the values reflecting fractional changes in various protein properties over the simulations indicates that the structural properties such as solvent accessibility, secondary structure and radius of gyration change very little even though the C
RMSD may increase.
|
|
Ramachandran distributions
Ramachandran distributions for each residue type were calculated and as an example the data for Ile are shown in Fig. 3. The distributions from Dynameomics are calculated with the (
,
) angles for Ile residues at every saved trajectory time point, post-equilibration, resulting in at least 20 000 points for each Ile residue. In addition to the distribution for all Ile residues in the Dynameomics dataset, we depict distributions of selected Ile residues that started the simulations in helical, beta and other (i.e. turns, extended, other or unstructured) conformations. For the purposes of these secondary structure assignments, we required at least three consecutive residues of
-helix or β-sheet to be present. We found it useful to examine the Dynameomics data in the context of Ramachandran distributions from two other sources: those derived from the static structures of proteins in the ASTRAL-40 dataset (Brenner et al., 2000
; Chandonia et al., 2002
, 2004
) and for comparison to a sterically unrestrained Ile distribution: data from our simulations at 298 K of the end-capped (acetylated and amidated) pentapeptide of sequence Gly-Gly-Ile-Gly-Gly (GGIGG) (Beck et al., 2008
).
|
There are 1342 Ile residues in the 188 targets, of which 382 are in helices, 303 in beta sheets and 657 in non-
, non-β classifications in their starting structures. We chose Ile for depiction in Fig. 3 because it is fairly evenly distributed between helices and β-sheets. In contrast, Ala residues in secondary structure elements are predominantly in helical conformations in the starting structures: of 1793 residues in the target set, the distribution was 650 and 187 for helices and beta sheets, respectively. Asn residues had the most unbalanced distribution of all the residues: of the 1067 Asn residues, 183 were in helices while only 55 were in β-structures. Residue-based contact preferences
Pairwise residue contact preferences differed between the starting structures and the simulations (Fig. 4). In our simulations, disulfide bonds are modeled using covalently bound Cys, while free Cys residues were modeled with a complete thiol group (Cyh). For obvious reasons, the most favorable contact in both sets was between disulfide-bonded Cys, –1.22 kT in the starting structures and –1.24 kT in the simulations. The second most favorable group of interactions involved charged residues (Glu, Asp, Lys, Arg). Contacts between oppositely charged residues were favorable in both datasets, averaging –0.87 ± 0.05 kT in the starting structures and –1.00 ± 0.04 kT in the simulations. Interestingly, contacts between like-charged residues became more favorable as well, shifting –0.22 ± 0.09 kT from the starting structures to the simulations, due to some relief of the repulsion from the slight expansion of the protein in response to the solvent environment and kinetic energy. Overall, there was an average favorable shift of –0.16 ± 0.12 kT in all types of charged residue contacts. In contrast, the hydrophobic residues exhibited only a small average favorable shift of –0.04 ± 0.05 kT.
|
Relative motions of helices in the native state
By mining our Dynameomics database, we identified 390 pairs of helices in contact in the starting structures for the 188 targets. Of these, 17 helix pairs had angular distributions with large standard deviations (greater than 10°) over time. One example is the change in orientation of helices
3 and
4 of Chemotaxis protein Y (CheY, [PDB:3chy]).
CheY is a 128 residue protein with a three-layer
/β/
Rossmann fold of rank 2 (Fig. 1, Step 2). While the 36 ns native-state simulation of CheY maintained the overall fold of the protein, there was an increase of approximately 35° in the angle between
3 and
4 due to the movement of
3 (Fig. 5A and B).
3 changed orientation to move 5 Å away from Asp 57 (a critical phosphorylation site in CheY), extending the loop between β3 and
3 (Fig. 5A). The
2 and
3 helix pair also had heightened dynamics throughout the simulation with an increase in angle of approximately 20°. In contrast,
4 and
5 were stable throughout the simulation. This latter pair does not meet our 10 contacts criterion as described in the methods, so we measure structural stability in terms of distance between the ends of the helices; the standard deviations for these distances are ±1.15 and ±1.03 Å.
|
| Discussion |
|---|
|
|
|---|
Assessments of simulation quality reveal simulations to be stable
The extent of energy drift in our Dynameomics simulations is quite minimal and indicates that the simulation engine is working correctly and that the simulations are computationally correct. The drift is a result of numerical round-off error inherent to any numerical integration by limited precision computing, even using 64-bit precision. To the best of our knowledge, these are among the most stable, most rigorous, protein simulations available. A more complete discussion of the origin of energy drift can be found elsewhere (Beck and Daggett, 2004
).
Given ergodic sampling, each simulation should reproduce the available experimental observables such as those from NMR. However, in our experience, most proteins require many tens to hundreds of ns to sample a sufficiently broad set of conformations to satisfy all the NOEs (Beck and Daggett, 2007
). We attribute many of the unsatisfied NOEs from our simulations to this limitation. Another factor was that some target domains were abstracted from larger protein entities. It is common for these domains to change structure, sometimes radically, in the absence of their binding partners or the rest of their chain. In addition, as we illustrate with ubiquitin, the simulations are in good agreement with NOEs and chemical shifts from NMR, S2 order parameters from NMR relaxation experiments and crystallographic B-factors.
Summary statistics depict stable proteins across structural classes and experimental sources
In the selected summary statistics, presented in Tables V and VI, there were no significant differences in overall protein properties between any of the structural subsets. This was affirmed when the complete set of summary statistics from all of the analytical properties were examined (data not shown). Our interpretation of these data is that the general protein structural classes (i.e. all
, all β,
/β,
+β) are free from any systematic bias as a result of the potential function or the simulation engine (such as overestimation of helical propensity). Further, NMR structures were indistinguishable from X-ray crystal structures for the purposes of simulation.
Several properties calculated from the trajectories, such as the count of intramolecular hydrogen bonds and main-chain to main-chain contacts, showed a decrease relative to the values calculated from the minimized starting structure (data not shown). Similarly, other properties, such as SASA, RMSD and radius of gyration exhibited an increase relative to the values from the minimized starting structures (Tables V and VI). The change is an expected consequence of increased thermal energy in the system and move to an explicit solvent environment.
An extreme example of this kind of change in compactness can be found in the excised DNA-binding domain HU [PDB:1huu] (Fig. 6A). This system is a dimer in its biological form (Fig. 6B). We initially simulated it as a monomer of 80 residues. At the start of that simulation, the hydrophobic dimer interface was exposed. The interface was buried in the first 10 ns by contortion of the domains second helix (Fig. 6A). This burial resulted in a drop in the total SASA of approximately 13%, a collapse in the monomers C
radius of gyration of nearly 20% and a large C
RMSD to the crystal structure of 7.5 ± 1.2 Å. In subsequent simulations of the dimer [PDB:1hue] (Fig. 6B), the equivalent monomer remained stable with a lower C
RMSD to crystal (3.2 ± 0.4 Å).
|
Where we observed a greater than two standard deviations of change in SASA or radius of gyration, it was exclusively the case that these targets had been excised from larger proteins. For these cases, our initial approach to excising target domains may not have been appropriate for all folds, particularly where it resulted in cutting peptide bonds and removing a covalently bound sub-chain from a larger protein or protein assembly. In one case, we identified 2 folds, glutathione S-tranferases N-terminal domain (fold rank 12) and its C-terminal domain (fold rank 27), that were amenable to simulation together as a single monomer with one continuous chain. As a proactive measure to avoid the problems related to excision of these domains, we simulated both of the target domains as a complete monomer (and subsequently the biological dimer) [PDB:1ev4]. We are now simulating all excised domains alone and in their biological units and eventually, their complete assemblies. The latter approach allows us to simulate multiple folds per target, thereby enhancing our understanding of the interplay between domains.
Comparison of Ramachandran maps from dynamic and static sources
For most residues, the location of the
-helical and β-strand peaks in the Ramachandran distribution may differ slightly depending on the origin of the data. This difference is illustrated by the Ramachandran maps for Ile residues given in Fig. 3. Relative to the ASTRAL-40 data from static structures, the Dynameomics data at 298 K are slightly more dispersed; however, there is little change in the location of the peaks. The difference in β-strand population between ASTRAL-40 and Dynameomics is primarily a result of the wider peak region in the dynamic dataset, not a significant change away from β-strand, as can be seen in the data from β-strand only residues in the Dynameomics dataset at 298 K. Similarly, from the Ramachandran maps of Ile residues starting in helical space, we see little change and a strong peak at the maximum observed in the ASTRAL-40 dataset. In contrast, the distribution for the Ile in the GGIGG peptide simulation shows marked shifts in the maxima for
-helix and β-strand, in addition to an overall more dispersed distribution when compared to those from Dynameomics and ASTRAL-40. In the well-hydrated sterically unhindered environment of GGIGG, the balance of energetic contributions for Ile is very different than when the residue resides in the context of folded proteins. Similar results are seen for the other 19 naturally occurring amino acids (Beck et al., 2008
).
A shift in contact preferences in the context of stable structures
We observed several general trends in contact preferences when comparing the simulations and the starting structures. First, the increased association of oppositely charged residues is indicative of the formation of salt bridges. Second, the increase in the favorability of like-charged interactions is due to formation of multi-body salt bridge networks, the presence of water-mediated hydrogen bond networks and the expansion of the structures compared with the static starting structures. Third, we found no significant change in the preferences for hydrophobic residues, suggesting that even in dynamic ensembles, the strength of the hydrophobic effect is similar to the effect in experimentally derived structures. The residue interactions derived here from a broad sample of dynamic native states provide a more complete view of the intermolecular interactions in proteins in solution and how they differ from those in static, averaged structures from experimental sources.
Native-state dynamics and implications for protein function
The movement of secondary structural elements can be important for protein function. For example, the breathing motion of the helices in myoglobin allows oxygen to diffuse through the protein to bind heme (Brunori, 2000
). Movement of helices is also important in ion channel opening and closing (Spencer and Rees, 2002
). Then, the fundamental questions are whether such motions are apparent in other proteins and whether they are important to function. The Dynameomics database [accompanying paper, Simms et al., 2008
)] provides us with a framework to investigate this question through the measurement of the angles of pairs of helices. In mining the database, we found that unphosphorylated CheY showed heightened dynamics of
2 and
3 causing them to move away from the rest of the protein, opening a cleft between helices
3 and
4 (Fig. 5). The movement of
3 leads to increased exposure of Asp 57, which oscillates with the helical motion. In contrast,
4 and
5 are stable over the course of the simulation.
To better understand the possible biological consequences of the heightened dynamics of
2 and
3 and the stability of
4 and
5 on its function in E. coli, we consider the chemotaxis cascade. CheY interacts with three proteins: chemotaxis protein A (CheA), chemotaxis protein Z (CheZ) and flagella motor switch protein M (FliM) (Fig. 5C) (McEvoy et al., 1998
; Lee et al., 2001
; Zhao et al., 2002
). In the absence of attractants, CheA is autophosphorylated and transfers the phosphoryl group to CheY at Asp 57. Phosphorylated CheY dissociates from CheA and binds FliM, a component of the flagella motor complex, causing the flagella to rotate clockwise and the cell to tumble randomly. Increasing amounts of attractants inhibit CheAs autophosphorylation, decreasing the cellular concentration of phosphorylated CheY. The CheY signal is terminated by the phosphatase CheZ. Unphosphorylated CheY does not bind the flagella complex, allowing the flagella to rotate counterclockwise and the cell to swim in a smooth, directed fashion towards increasing concentrations of attractants (Wadhams and Armitage, 2004
).
Structures of CheY in complex with CheA, CheZ and FliM have revealed a common binding site on the
4-β5-
5 face of CheY (McEvoy et al., 1998
; Lee et al., 2001
; Zhao et al., 2002
). The movements of
2 and
3 do not disrupt this binding site and the distance between
4 and
5 remains stable over the course of the simulation. We propose that the movements of
2 and
3 may serve as an entropy sink to help stabilize the protein and compensate for the relatively immobile
4 and
5, which serve as a scaffold for protein–protein interactions. Furthermore, the motion of
3 leads to heightened, but transient, exposure of the phosphorylation site: Asp 57.
Interestingly, previous MD simulations of another member of this fold, catechol O-methlytransferase (COMT), have shown movement of helices
6,
7 and
8 of COMT (Rutherford et al., 2006
), which correspond to
2,
3 and
4, respectively, of CheY (Fig. 7A), which suggests that motion in this region may be a conserved feature of this fold. The soluble form of COMT has 221 residues with a central β-sheet surrounded by 8
-helices (Fig. 7A). At position 108, the human COMT sequence has a common single nucleotide polymorphism substituting Val (108V) for Met (108M). This polymorphism is associated with a decrease in enzymatic activity (Spielman and Weinshilboum, 1981
; Boudikova et al., 1990
; Chen et al., 2004
) and an increased risk for diseases such as breast cancer (Goodman et al., 2002
) and obsessive-compulsive disorder (Karayiorgou et al., 1997
). The differences in activities between 108M and 108V COMT have been attributed to the lower thermostability of 108M COMT (Lotta et al., 1995
). However, the addition of S-adenosylmethionine (SAM) rescues its activity. The polymorphic site is located in a loop between
5 and β3, approximately 16 Å from the active site.
|
MD simulations of human COMT models revealed that there is heightened motion of
6 and
7 of COMT (corresponding to
2 and
3 in CheY) and that this motion becomes dramatic in the case of the 108M polymorph (Rutherford et al., 2006
3 and
4 of CheY (Fig. 5B). The simulations also showed that residue 108 had a greater variability of interactions within the polymorphic site and was more solvent exposed in 108M than in 108V COMT. The 108M COMT simulations sampled a broader distribution of structures with a larger average C
-RMSD to the starting structure than 108V COMT simulations at physiological temperature (Rutherford et al., 2006
6 in 108M COMT had a larger effect on the solvent accessibility of the SAM binding site than in 108V (Fig. 7B). Helix
6 moved away from the core of the protein in 108M simulations, increasing the angle between helices
6 and
7 by approximately 30° (Fig. 7C). The angle between helices
7 and
8 also changed by approximately 20° (Fig. 7C). Interestingly, the green helix, as depicted in Fig. 7A, went to the right in COMT and left in CheY. We hypothesize that a larger proportion of 108M COMT exists in inactive conformations at physiological temperature (Rutherford et al., 2006
MD simulations were also performed in the presence of SAM for both 108V and 108M COMT, which decreased the C
RMSF about the mean structure for
3 to
6. The presence of SAM in the simulations minimized the helix motions and stabilized the SAM binding site, thereby increasing the population of the catalytically productive conformer. Therefore, MD reveals conserved dynamical behavior within a fold family, reinforcing the proposal that a single representative of a fold can describe the dominant, general features of the dynamics of other members of the family, although this may not always be the case. In CheY, the dynamic behavior appears to increase the entropy and offset the cost of maintaining a scaffold for binding, as well as to facilitate or regulate phosphorylation. However, this is a fine balance and the same motion in COMT, in concert with a mutation, leads to dramatic effects on stability and activity. Knowledge of such deleterious dynamics may be useful for rescuing proteins, as illustrated through the protective effect of SAM on COMT.
| Conclusions |
|---|
|
|
|---|
The Dynameomics project has generated the largest collection of protein MD simulations to date. We are continuing to expand the dataset by simulation of new fold space representatives in their native state and along their unfolding pathways. The top 30 targets of the Dynameomics database are available via our web site: www.dynameomics.org. The database can be mined for various features of native-state dynamics, such as we have demonstrated in this work by our identification of helix motions. The results from the data mining show how heightened dynamics relevant to function and stability in one protein (CheY) are implicated in disease-related mutations for a fold-space neighbor: COMT. We expect such patterns of knowledge-based discovery to continue as more data mining is conducted on this database.
| Funding |
|---|
|
|
|---|
In the early phase of this project, seed financial support was provided by the Biomedical and Health Informatics Program at the University of Washington and the eScience program of Microsoft Research. We are also grateful for the financial support provided by Technical Computing @ Microsoft; in particular, we thank Dr Tony Hey and Dan Fay. R.D.S. and A.L.J. were supported by the NIH Molecular Biophysics Training Grant (National Research Service Award 5 T32 GM 08 268-17) at the beginning of this study.
| Footnotes |
|---|
4 Current address: Structural Bioinformatics and Computational Biochemistry Unit, University of Oxford, Oxford OX1 3QU, UK
5 Current address: Department of Physics, Applied Physics, and Astronomy, Renssalaer Polytechnic Institute, Troy, NY 12181, USA ![]()
Edited by Alan Fersht
| Acknowledgements |
|---|
|
|
|---|
Most simulations in this study were performed using Department of Energy grants of processor time at the National Energy Research Supercomputing Center (NERSC), initially through an Innovative and Novel Computational Impact on Theory and Experiment (INCITE) award and later through the Office of Biological and Environmental Research (OBER); we are grateful for the supp


3B/8
2 where B =B-factor of the C



