PEDS Advance Access originally published online on April 14, 2008
Protein Engineering Design and Selection 2008 21(6):369-377; doi:10.1093/protein/gzn012
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations
1Biomedical and Health Informatics Program, University of Washington, Seattle, WA 98195, USA 2Department of Bioengineering, University of Washington, Box 355013, Seattle, WA 98195-5013, USA
3 To whom correspondence should be addressed. E-mail: daggett{at}u.washington.edu
| Abstract |
|---|
|
|
|---|
Dynameomics is a project to investigate and catalog the native-state dynamics and thermal unfolding pathways of representatives of all protein folds using solvated molecular dynamics simulations, as described in the preceding paper. Here we introduce the design of the molecular dynamics data warehouse, a scalable, reliable repository that houses simulation data that vastly simplifies management and access. In the succeeding paper, we describe the development of a complementary multidimensional database. A single protein unfolding or native-state simulation can take weeks to months to complete, and produces gigabytes of coordinate and analysis data. Mining information from over 3000 completed simulations is complicated and time-consuming. Even the simplest queries involve writing intricate programs that must be built from low-level file system access primitives and include significant logic to correctly locate and parse data of interest. As a result, programs to answer questions that require data from hundreds of simulations are very difficult to write. Thus, organization and access to simulation data have been major obstacles to the discovery of new knowledge in the Dynameomics project. This repository is used internally and is the foundation of the Dynameomics portal site http://www.dynameomics.org. By organizing simulation data into a scalable, manageable and accessible form, we can begin to address substantial questions that move us closer to solving biomedical and bioengineering problems.
Keywords: data warehouse/database/Dynameomics/OLAP/protein dynamics
| Introduction |
|---|
|
|
|---|
Our fundamental mission is to study the dynamic nature of proteins and how sequence, structure and motion dictate fold and function. Over the past 15 years, we have developed a variety of methods and software tools to simulate the conformational behavior of proteins using molecular dynamics (MD) simulations. These methods have been applied in many projects, producing a significant volume of simulation data. The management of these data and the need to provide easy access for analysis, especially cross-simulation analysis, were the major motivations for developing a large-scale, reliable and manageable repository.
In particular, our Dynameomics (Beck et al., 2008a) project was chosen to drive the initial warehouse design based on the large number of simulations involved, significant data management overhead and a requirement of analysis across many protein simulations. Over 3000 simulations representing more than 400 protein targets have been completed so far, constituting the largest collection of protein simulations in the world with over 103 times more structures than the Protein Data Bank (PDB) (Berman et al., 2002, 2003
). The lab continues to simulate additional targets at a rate of at least 100 per year with typically six to eight simulations per target. This approach has already produced >52 terabytes (TB) of simulation data stored in flat files, and we estimate the generation of at least an additional 15 TB of data every year.
A proteins function is dictated by its three-dimensional structure and dynamic behavior. The high-resolution study of protein dynamics is important for understanding function, binding and recognition. Protein flexibility is important in rational drug design, and the absence of realistic dynamics is a weakness of current methods. MD simulations can be used to characterize protein dynamics in solution at atomic resolution. These simulations track the three-dimensional coordinates of all the atoms of one or more proteins and solvent typically over periods of nanoseconds to microseconds. An average protein simulation from Dynameomics contains >100 residues comprised of >1600 atoms, not including solvent. The choice of proteins simulated for this project is guided by population rank, which is an ordered list based on the frequency with which structural units, or folds, appear in known protein structures (Day et al., 2003
). The goal of the project is to simulate a representative protein (called a target) of each non-redundant protein fold.
The protein-folding problem is one of the most important unsolved questions in molecular biology: how does a sequence of amino acids self assemble into its three-dimensional, biologically active structure? It has been over 15 years since the method of performing unfolding simulations to probe the folding process was introduced (Daggett and Levitt, 1992
). Since that time, we have expended considerable effort validating our methods and software by comparison against experiment. High temperature induces unfolding on timescales accessible by simulation and reveals important transition, intermediate and denatured state ensembles that can be experimentally verified. Agreement between simulation and experiment has been demonstrated for a number of proteins (Daggett et al., 1996
; Daggett et al., 1998
; Kazmirski and Daggett, 1998
; Daggett and Fersht, 2003
; Mayor et al., 2003
; Ferguson et al., 2005
; Day and Daggett, 2007
). Unfolding pathways are equivalent to folding pathways under the same conditions, in agreement with the principle of microscopic reversibility (Day and Daggett, 2007
; McCully et al., 2008
). Consequently, for the selected targets, high-temperature unfolding simulations are also performed. Thus, Dynameomics presents a unique opportunity to investigate the dynamics of the native state as well as to search for the rules and patterns that govern protein folding. Organizing the simulation data into a data warehouse opens the door to questions that require synthesis of data from the entire set of simulated proteins.
We employ a process known as target selection and preparation to identify specific proteins of interest and importance. This process is described graphically in Fig. 1. After selection, simulations are prepared for execution at a supercomputer site or on local high performance computing hardware. All simulation data are produced by an in-house developed molecular modeling program called in lucem molecular mechanics (or ilmm) (Beck et al., 2000–2008; Beck and Daggett, 2004
). Depending on the complexity of the protein system being simulated and the range of time needed, simulations can take from a few days to months to complete. The primary output of ilmm simulations are sets of binary compressed files called Molecular Dynamics Compressed (MDC) files. The MDC files primarily contain three-dimensional coordinates for every atom in the system over time, i.e. both the protein or proteins and solvent (typically water). Coordinates are saved every 0.2–1 picosecond (ps). Simulations for the Dynameomics project are at least 21 nanoseconds (ns), which produces over 21 000–105 000 sets of 3D coordinates for each atom in the system. The number of protein atoms in the current set of simulations range from 554 to 6584, the waters add another 7000–60 000 atoms. The protocol for the Dynameomics project calls for simulations of a given system at two primary temperatures (298 K, 498 K), three longer simulations at 1 ps resolution for 21–31 ns: one, 21 ns simulation at 298 K and two 31 ns simulations at 498 K, in addition to at least three shorter 2 ns 498 K simulations at 0.2 ps resolution.
|
After simulations are complete, we perform a series of analyses on raw coordinate data and archive the results to a shared file system. Although the analysis data are typically smaller than the base coordinates, they still represent a significant storage challenge. Table I shows sizes for coordinate and analysis data in various formats.
|
Our approach to managing all these data was to build a data warehouse consisting of a relational database to serve as the store-of-record and to utilize both the relational database and a Multidimensional On-line Analysis Processing (MOLAP) database for analysis (see the following paper by Kehl et al., 2008
We employed standard software development methods, including relational database modeling and multidimensional modeling, to design a novel hybrid database to accommodate the varied access to our large, complex data sets. This approach should be generally applicable in other area of bioinformatics as well as other disciplines experiencing a data explosion. We used a commercially available SQL and data warehousing suite, Microsoft SQL Server 2005 Enterprise Edition x64 (Microsoft, 2005) (SQL, Integration Services, and Analysis Services), as the software platform. Our primary result is a repository that can accommodate all of our current data and will scale to meet the demands of new simulation data as they are produced. Most importantly, the data warehouse will be used as the primary data access method by members of our lab and the scientific community.
| Materials and methods |
|---|
|
|
|---|
The data warehouse is implemented in a heterogeneous operating system environment running on a variety of networked 32 bit and 64 bit AMD Opteron and Intel based computers. Database servers run Windows 2003 Server Enterprise Edition x64 (Microsoft); file servers run Red Hat Fedora Core Linux (Red Hat). All servers are accessed from both Windows XP (Microsoft) and Red Hat Linux clients. Database servers run SQL Server 2005 Enterprise x64 Edition (Microsoft, 2005) and utilize components from SQL, Integration Services and Analysis Services.
The prep database, short for target selection and preparation database, is a small transactional SQL database (under 250 MB) that tracks the workflow decisions related to picking targets as well as abstracting key entities from external data sources. A conceptual model for the database is shown in Fig. 2. The key tables are Fold, Target and Simulation.
|
The Fold table captures information about the set of known protein domains and provides a layer of abstraction between the three primary fold classification systems: SCOP (Murzin et al., 1995
The directory and simulation databases
The directory and simulation databases form the core of the data warehouse. The directory database serves two primary functions: it stores the location of every simulation (server and database name) and it is the master source of identifiers for entities across the entire warehouse. The table structure is relatively simple and the primary entities are illustrated in Fig. 3. With 2023 simulations loaded, the total size including data and log files is under 5 GB.
|
The simulation database houses individual simulations and associated analyses. The schema follows a relational design pattern typical of multi-dimensional modeling for data warehouse applications. Although originally implemented only for extraction, transformation and load (ETL) operations and building OLAP cubes, the SQL representation has also proved valuable for queries. Coordinate and analysis data from simulations are bulk loaded into this schema and constraints are applied to ensure data integrity. A snapshot representing a fraction of the coordinate data and analysis tables is shown in Fig. 4. We load 32 core analyses (Beck et al., 2008a) on currently generated data, although it will be easy to expand this model to accommodate additional data.
|
Although there is no software limit for the size of a table and no practical limit on database size (1048 516 TB (Microsoft, 2005)), we have elected to limit the size of databases and tables for several administrative reasons. First, this allows us to take advantage of parallel loading across multiple servers. By loading simulations into separate tables, we can apply constraints and indices early in the load process yet allow new data to use heap structured tables that are optimized for bulk loading.
In addition to using multiple servers, we have also chosen to implement multiple databases per server. This approach allows us to control the number of tables (currently we store approximately 100 simulations per database) as well as the total size of each database. The simulation databases are currently distributed across two servers but are designed to be implemented cooperatively across many machines. Identity assignment functions are relegated to the Directory database so that all cooperating servers use compatible identifiers. The size of each simulation database is directly related to the size of the simulation and analysis data it contains. At the time of preparation of this article, the warehouse consists of 22 databases each allocated to 300 GB (2035 individual simulations covering 276 proteins). The warehouse has been constructed in this way to ensure scalability as we continue to perform simulations.
SQL Views are virtual tables that are constructed using queries. The import code automatically builds views at the database level. These views contain all the data in each individual simulation or analysis table as well as the server and database name in a single easy to query structure. Database level views are aggregated across the entire data warehouse in the Directory database. The linked server facility is used to implement a federated structure that spans servers and databases. The hierarchy of aggregation is presented in Fig. 5.
|
| Results |
|---|
|
|
|---|
We have developed a repository that accommodates over 25 TB of MD simulation data and metadata, and we have captured the decisions and workflow of the researchers producing these data. The set of simulations and proteins currently loaded into the repository are described in Tables II and III, respectively. We anticipate that the current model will scale beyond 100 TB based on a two tiered view hierarchy (100 simulations per database view, up to 256 databases per top level view). The repository consists of four major components: (i) the Prep database, a transactional database that models a view of protein fold space, a set of potential targets that are representative of folds of interest, and a set of simulations, which consist of one or more targets that are actually simulated; (ii) the directory database, which handles identifier assignment and maps simulation data to specific servers; (iii) the simulation database, a set of relational databases that hold coordinate and analysis data from simulations and implements data integrity rules; and (iv) a multi-dimensional OLAP database that removes the structural complexity of the relational store and presents a single easy-to-use view of all coordinate and analysis data. These components now play pivotal roles in supporting the overall lab data workflow, as shown in Fig. 6. They are also the foundation for lab software such as DBAnal, an interactive visualization tool (Fig. 7) that is a web embeddable Java Applet that provides the ability to interactively browse graphical views of analyses and provide data directly to a variety of commercial software packages. Also, by clicking on an analysis graph of DBAnal, the applet will display the structure at the selected time point in a Jmol (Jmol) viewer (Fig. 7). This structure can also be saved locally for further analysis.
|
|
|
|
We also have a number of applications in development. Adapting and creating visualization tools is a high priority for the coming year. One previously developed visualization tool written in Java was quickly adapted to access the database via JDBC and SQL, MDX queries using linked servers and OPENQUERYSET is in progress. A queuing system application has been developed to aid in the scheduling of local simulation jobs (Beck et al., 2007–2008). The queuing system updates simulation status in the prep database in real-time and drives automatic import of data into the repository.
The repository runs on multiple servers and supports data access from a variety of client operating systems and tools. The current server operating system and database platforms are Windows 2003 Server Enterprise Edition x64 and SQL Server 2005 Enterprise Edition x64, respectively. The break down of source code is shown in Table IV.
|
Interfacing with other tools
In order to facilitate in-depth analysis and mining of the Dynameomics data set, we chose to adopt the software package Mathematica (Wolfram Research Inc., 2005
). Mathematica is a mathematical processing and visualization tool with inherent support for JDBC, a Java protocol that can communicate with any data source with JDBC compliant driver (Microsoft publishes a JDBC driver for use with SQL Server 2005). Using Mathematica, one can connect to SQL Server and issue SQL commands directly or use various abstraction operators to formulate queries. The results from either approach are returned as Mathematica lists. These lists can then be processed, exported or plotted in a variety of formats; Mathematica includes many sophisticated graphical tools that encompass virtually all of the visualization needs for our coordinate and analysis data.
Although the Mathematica front-end gives a nearly complete SQL interface, we initially encountered many problems with the JDBC libraries and Mathematicas Java link, which, by default, are not optimized for the quantity of data that we regularly deal with. Attempting to download all the atomic coordinates from an entire trajectory, for example, initially always caused the JDBC driver to run out of memory and fail. Two methods dealt with this efficiently. The first was to create special Mathematica functions to download trajectories and other large data sets in small chunks then link them together at the end. This approach has a small performance cost, but the effect is minimal for our purposes. Alternately, one can instruct the Java virtual machine (via Mathematicas interface) to allocate more memory. This approach has a smaller performance cost but may become problematic as the quantity of data that is needed for a single query grows (i.e. as the trajectories in the database become longer).
In addition to downloading the results of SQL queries, Mathematica has the ability to create tables and insert data back into the database. Data insertion is considerably slower than data retrieval and suffers from the same problems concerning Mathematicas Java link, but similar techniques were used to surpass these difficulties as well.
With additions to the standard Mathematica function library, the link to the Dynameomics database is almost transparent and, on our local network, it operates faster than reading the data from files. For example, we did some comparison of mining the data from the file system using Perl scripts and SQL equivalents. A representative script used in the lab to generate data for a Ramachandran plot (a visualization of dihedral angles) for 188 simulations consists of 246 lines of code and limited comments. About 15% of this code is devoted to looping over directories and identifying files of interest using regular expressions. Another 30% iterates over the identified files and calls subroutines. There are also header comments and usage documentation (10%), two subroutines (30%), argument processing and declarations (15%). The execution time is 3 –4 h. In comparison, the SQL code to generate the same result is 26 lines long and runs across the entire simulation set in just over an hour.
We also did comparisons using Mathematica to perform complex analyses from the database and using C programs to perform equivalent analyses. For example, we wrote a C program that loads the coordinates for the C
atoms in a protein at every picosecond, calculates the Fourier transform of each C
atom along its first principal axis and saves the results to files. This program consisted of 90–100 lines of code (not counting external libraries such as those libraries that read the flat files) depending on how the program was spaced and commented. An equivalent program in Mathematica ran in comparable time and required only 3–10 lines of code. Notably, the C program required approximately 3 h to write for an expert C programmer while the Mathematica program required only 10 min to write for a novice Mathematica programmer.
SQL views for accessing data across the entire data set
Designing a SELECT query to gather rows from all coordinate simulation tables would involve a complex JOIN clause referencing over 2000 tables. The query would be further complicated by distribution of databases across multiple servers, and it would require revision whenever new coordinate simulation tables are created. As mentioned previously, SQL views are constructed during the import of data to abstract this complexity. We originally believed that these automatically generated views would only be useful for dumping the entire contents of a property (e.g. when populating ann OLAP data cube) but not really efficient for queries due to our distributed implementation. This belief turned out to be incorrect.
A key finding is that views are efficient for general purpose queries when the participating tables are configured with check constraints. Check constraints are declarative rules that can enforce restrictions on the data columns of a table (domain integrity checking). The rules are written as expressions that evaluate as true or false based on the value of a column (e.g. struct_id=1). If a column in a row about to be inserted would violate the expression, the row would be rejected. However, the real utility of check constraints is for the query optimizer, especially in distributed queries. As an example, a SELECT COUNT(*) FROM dbo.Master_Coord_v statement will count the rows in all the coordinate tables in the warehouse and return the total number of rows. This query by definition must hit every row of every coordinate table and in one case took over an hour to count in excess of 2 billion rows. However, if the count from just a few simulations of interest is needed, the optimizer can utilize the constraints and automatically exclude tables that would not contribute any results to the query based on the constraints (a similar query limited to three simulations but still run against the same view returned a total of 242 360 000 rows in under 1 min). Without constraints, the query would be forced to read all participating tables.
In addition to exploring large sets of data, we are using views to produce and share smaller, ad hoc result sets. Views are assembled from arbitrary queries that can involve tables, views or combinations. The result becomes a named object on the server, with the same access paradigm as a table. Similar to tables, views persist until they are explicitly dropped, and access can be restricted using SQL permissions. However, unlike tables simple views do not incur storage overhead and stay up-to-date as data are updated or added to the underlying data structures. We used this approach in the study of relative motions of helices, which is described in Beck et al. (2008a). We are investigating the possibilities of utilizing materialized views (which allow indexing) for future applications.
Although the utility of managing data in a relational database was never in doubt, we anticipated that after all data stored in flat files were loaded, researchers in the lab use would primarily access the data using Multi-dimensional Expression (MDX) queries against one or more OLAP cubes. However, a key finding is that it is often useful to code some queries in SQL, others in MDX and, in some cases, it is helpful to pipeline the results of multiple SQL and MDX queries to retrieve data of interest. Although we plan to do more performance investigation in the future, we believe that fundamentally MDX seems to be faster at filtering results based on fixed criteria. In other words, assembling a dataset of interest may be most efficient from MDX. This topic is discussed in greater detail in the accompanying paper (Kehl et al., 2008
).
Declarative and procedural programming
Many calculations are easier to code using a procedural language such as C. However, this runs counter to the declarative access model supported by relational databases and is referred to as the impedance mismatch problem (Elmasri and Navathe, 2000
). However, SQL Server 2005 supports the execution of procedural code in a SQL context using a feature known as common-language runtime support. This facility allows programmers to build stored procedures and table valued functions using a procedural language of their choice, e.g. C#. These procedures and functions can be composed into more complex programs using SQL, C#, or simply used interactively.
| Discussion |
|---|
|
|
|---|
A simple transactional SQL database proved to be an ideal tool to organize the rich decision process used to select simulation targets for our Dynameomics project. The design process initially exposed inconsistencies and opportunities for improvement during the target selection and preparation process. As requirements were gathered, these inconsistencies were eliminated by capturing fundamental entity relationships in a Unified Modeling Language (UML) static structure diagram and documenting workflow with activity diagrams. This resulted in the previously chaotic selection and archival process (lab books, post-its, readme files) to be rigorously defined, improved and systematically logged. Implementing on top of a multi-user transactional database allowed lab members to work independently without fear of overwriting information in shared files.
Moving simulation data out of the file system and into a relational database greatly reduced the complexity of managing simulation and analysis data. As data were loaded for simulations, it became possible to easily check the data for issues using queries (e.g. gaps in time steps from premature stops). Declarative integrity and check constraints were implemented to prevent the loading of any inconsistent data in the future. Constraints have also helped to eliminate a problem of lost structures in simulations. Such lost structures would occur infrequently, but silently, when a disk, file system or program failure corrupted one or more coordinate files (we have over 250 disks). Data access became possible through a greater variety of tools and interfaces, and complex logic to navigate directory structures was no longer needed. However, as expected, the relational implementation for the dimensional model is too complex to query using SQL. For this reason, we developed distributed views in SQL and an OLAP database to simplify queries that span sets of simulations.
While the data warehouse approach has many positive benefits, it has also introduced many challenges. The first of these challenges was getting all the existing data into the warehouse. It took
4 months to load the coordinate data after arriving at a reasonably stable schema. Consequently as we continue to load new trajectories and the dataset grows, we must plan schema revisions carefully. The sheer volume of the data dictates the use of very large disk arrays, which are expensive even when implemented using low-cost SATA disk technology. Database servers also require significantly more memory and CPU power than file servers. The SQL representation of simulation data is also larger than our binary compressed MDC format used previously. This effectively reduces the volume of data that can be stored per server. We also intend to implement clustering for load balancing and redundancy, which will effectively double our current set of servers and require pairs of servers to be installed as we scale out in the future.
Our choice of database software has also presented challenges. SQL 2005 is a relatively new product release (November 2005) and we have encountered and reported many bugs. The volumes of data we load effectively require us to use partitioning, a feature only available in the highest cost Enterprise Edition of Analysis Services 2005. We have also encountered a lack of detailed, technical documentation of Analysis Services.
In summary, the decision to implement a data warehouse and migrate away from accessing simulation data through a shared file system has produced fundamental changes in the lab. First, it has provided a framework for documenting and streamlining manual group tasks. This has greatly reduced the need to maintain handwritten lists, obscure copies of files and eliminated many of the one-shot query programs previously used to extract information from the file system archive. Second, it has created a consistent abstraction layer between the physical layout of the data and access to the data. This has simplified access programs that now no longer need to code around different versions of directory structures and file formats. Third, the use of integrity constraints and check constraints in the load and prep databases have identified and prevented inconsistent data from tainting the repository. We have designed a front-end interface to enable access to simulation data and analyses via the World Wide Web and this first deployment contains data and metadata for the top 30 targets (http://www.dynameomics.org). With simulation data organized in a manageable and accessible form, we are now able to investigate classes of questions that are practically impossible to answer using flat files. The answers to these large-scale questions should help eventually to solve the protein folding problem and other biomedical and bioengineering problems.
| Funding |
|---|
|
|
|---|
A.M.S. and N.C.B. are supported by a National Library of Medicine fellowship (NIH grant 3 T15 LM007442-04S1). The Washington Research Foundation provided funds for the website. The MD trajectories contained in the data warehouse were produced using resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
| Footnotes |
|---|
Board Member: Jane Clarke
Edited by Alan Fersht
| Acknowledgements |
|---|
|
|
|---|
We are also grateful for financial support from the eScience initiative through Microsoft Research and more recently through Technical Computing @ Microsoft (www.microsoft.com/science).We thank Stuart Ozer of Microsoft for providing design feedback and general assistance with SQL Server 2005.
| References |
|---|
|
|
|---|
Beck D.A., Daggett V. Methods (2004) 34:112–120.[CrossRef][Web of Science][Medline]
Beck D.A.C., Alonso D.O.V., Daggett V. in lucem molecular mechanics. Seattle, WA: University of Washington. (2000–2008).
Beck D.A.C., Scouras A., Daggett V. iq queuing system. Seattle, WA: University of Washington. (2007–2008).
Beck D.A.C., Jonsson A.L., Schaeffer R.D., Scott K.A., Day R., Alonso D.O.V. Protein Eng. Des. Sel. 21:353–368.
Beck D.A.C., Alonso D.O.V., Inoyama D., Daggett V. Proc. Natl Acad. Sci. USA (2008) b. (in press).
Benson N.C., Daggett V. Proc. Natl Acad. Sci. USA (2008) in press.
Berman H.M., et al. Acta Crystallogr. D. Biol. Crystallogr. (2002) 58:899–907.[CrossRef][Medline]
Berman H., Henrick K., Nakamura H. Nat. Struct. Biol. (2003) 10:980.[CrossRef][Web of Science][Medline]
Berrar D., Stahl F., Silva C., Rodrigues J.R., Brito R.M., Dubitzky W. J. Clin. Monit. Comput. (2005) 19:307–317.[CrossRef][Medline]
Daggett V., Fersht A. Nat. Rev. Mol. Cell Biol. (2003) 4:497–502.[CrossRef][Web of Science][Medline]
Daggett V., Levitt M. Proc. Natl Acad. Sci. USA (1992) 89:5142–5146.
Daggett V., Li A., Itzhaki L.S., Otzen D.E., Fersht A.R. J. Mol. Biol. (1996) 257:430–440.[CrossRef][Web of Science][Medline]
Daggett V., Li A., Fersht A.R. J. Am. Chem. Soc. (1998) 120:12740–12754.[CrossRef][Web of Science]
Day R., Daggett V. J. Mol. Biol. (2007) 366:677–686.[CrossRef][Web of Science][Medline]
Day R., Beck D.A., Armen R.S., Daggett V. Protein Sci. (2003) 12:2150–2160.[CrossRef][Web of Science][Medline]
Elmasri R., Navathe S.B. Fundamentals of Database Systems (2000) 3rd edn. Addison Wesley.
Essex J., Sansom M.S.P., Biggin P.C., Pikunic J., Mulholland A.J., Claeyssens F., Laughton C.A., Smith L., Sherwood P. (2008).
Ferguson N., Day R., Johnson C.M., Allen M.D., Daggett V., Fersht A.R. J. Mol. Biol. (2005) 347:855–870.[CrossRef][Medline]
Holm L., Sander C. Nucleic Acids Res. (1997) 25:231–234.
Hubbard T.J., Murzin A.G., Brenner S.E., Chothia C. Nucleic Acids Res. (1997) 25:236–239.
Jmol. Jmol: an open-source Java viewer for chemical structures in 3D.
Kazmirski S.L., Daggett V. J. Mol. Biol. (1998) 277:487–506.[CrossRef][Web of Science][Medline]
Kehl C., Simms A.M., Toofanny R.D., Daggett V. Protein Eng. Des. Sel. (2008) 21:379–386.
Kimball R., Reeves L., Ross M., Thornthwaite W. The Data warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses (1998) John Wiley & Sons, Inc.
Lee T.J., Pouliot Y., Wagner V., Gupta P., Stringer-Calvert D.W., Tenenbaum J.D., Karp P.D. BMC Bioinform. (2006) 7:170.[CrossRef][Medline]
Mayor U., Guydosh N.R., Johnson C.M., Grossmann J.G., Sato S., Jas G.S., Freund S.M., Alonso D.O., Daggett V., Fersht A.R. Nature (2003) 421:863–867.[CrossRef][Medline]
McCully M., Beck D.A.C., Daggett V. Biochemistry (2008) (in press).
Microsoft Corporation. Microsoft Windows 2003 Server Enterprise x64 Edition, Microsoft Corporation.
Microsoft Corporation. Microsoft Windows XP Professional, Microsoft Corporation.
Microsoft. (2005) SQL Server 2005 Enterprise x64 Edition, Microsoft Corporation.
Murzin A.G., Brenner S.E., Hubbard T., Chothia C. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]
Orengo C.A., Michie A.D., Jones S., Jones D.T., Swindells M.B., Thornton J.M. Structure (1997) 5:1093–1108.[Medline]
Red Hat Inc. Fedora Core, Red Hat, Inc.
Rueda M., Ferrer-Costa C., Meyer T., Perez A., Camps J., Hospital A., Gelpi J.L., Orozco M. Proc. Natl Acad. Sci. USA (2007) 104:796–801.
Scott K.A., Alonso D.O.V., Sato S., Fersht A.R., Daggett V. Proc. Natl Acad. Sci. USA (2007) 104:2661–2666.
Wolfram Research Inc. Mathematica (2005) Champaign, IL: Wolfram Research, Inc.
Received March 3, 2008; revised March 3, 2008; accepted March 4, 2008.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
D. A. C. Beck, D. O. V. Alonso, D. Inoyama, and V. Daggett The intrinsic conformational propensities of the 20 naturally occurring amino acids and reflection of these propensities in proteins PNAS, August 26, 2008; 105(34): 12259 - 12264. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







