PEDS Advance Access originally published online on June 8, 2007
Protein Engineering Design and Selection 2007 20(7):347-351; doi:10.1093/protein/gzm027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Amino acid quantitative structure property relationship database: a web-based platform for quantitative investigations of amino acids
1Departments of Biological Sciences 2Computer Science and Electrical Engineering, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
3 To whom correspondence should be addressed. E-mail: freeland{at}umbc.edu
| Abstract |
|---|
|
|
|---|
Here, we present the AA-QSPR Db (Amino Acid Quantitative Structure Property Relationship Database): a novel, freely available web-resource of data pertaining to amino acids, both engineered and naturally occurring. In addition to presenting fundamental molecular descriptors of size, charge and hydrophobicity, it also includes online visualization tools for users to perform instant, interactive analyses of amino acid sub-sets in which they are interested. The database has been designed with extensible markup language technology to provide a flexible structure, suitable for future development. In addition to providing easy access for queries by external computers, it also offers a user-friendly web-based interface that facilitates human interactions (submission, storage and retrieval of amino acid data) and an associated e-forum that encourages users to question and discuss current and future database contents.
Keywords: amino acids/database/QSPR/XML
| Introduction |
|---|
|
|
|---|
Beyond protein synthesis, amino acids play many significant roles in biology: as intermediates of metabolic pathways, neurotransmitters, antibiotics, etc. Furthermore, amino acids that have been never synthesized in nature are now routinely incorporated into biological systems to aid scientists who investigate fundamental questions of biology and medicinal chemistry. Thus, quantitative investigations of the relationship between amino acids, natural and engineered, are not only critical for biological research and bioengineering (such as predicting the biological activity of natural and engineered peptides, e.g. Guan et al., 2005
This disparity of information between the 20 biologically encoded amino acids and all others has been in large part attributable to the costly and time-consuming traditional experimental approach required to measure amino acid biophysical properties (Haidacher et al., 1996
). However, recent developments in computational chemistry offer us an alternative method to quickly and reliably estimate values for key amino acid properties (for example, freely accessible web software can predict van der Waals volume with an accuracy, measured as coefficient of determination between predicted and experimentally determined values, of 0.955; Lu and Freeland, 2006b
).
Therefore, here we introduce a novel and freely accessible online extensible markup language (XML) database, the AA-QSPR Db. Currently, the database comprises a total of 388 amino acids: the 20 amino acids found in the standard genetic code, 177 that have been found in biological systems, acting as intermediates in main metabolic pathways, neurotransmitters (Venton et al., 2006
) and antibiotics (Czajgucki et al., 2006
) but have never been incorporated into the genetic code, for example, ornithine and sarcosine (Garrett and Grisham, 1999
), 69 that are thought to be synthesized abiotically (Cronin and Pizzarello, 1986
), 108 that are products of post-translational modification (Uy and Wold, 1977
) and 58 that have been engineered by scientists (Summerer et al., 2006
).
The abiotic amino acids are mainly identified from analyses of the Murchison meteorite (Cronin et al., 1981
, 1985
; Cronin and Pizzarello, 1986
), or described as products of pre-biotic simulation experiments (Miller, 1986
) and are particularly relevant to thinking in exobiology and the origin of life (Cronin and Pizzarello, 1997
; Glavin and Bada, 2001
). Engineered amino acids are largely drawn from recent progress in the research of the incorporation of nonnatural amino acids into proteins (Link et al., 2003
; Hendrickson et al., 2004
; Wang et al., 2006
).
Our AA-QSPR Db comes with an associated toolkit that provides broad utility within the field of biochemical ontology: possible applications range from research into protein structure and amino acid bioactivity to synthetic biology and evolutionary analyses.
| Technical description of the database |
|---|
|
|
|---|
Contents and tools of the database
Amino acids considered
A major criterion we used to determine biochemical relevance of an amino acid (and thus its incorporation into our database) is its ability to form peptide bond with another amino acid. Thus, we did not include 2,5-diaminopyrrole, an amino acid derivative found in Murchison meteorite (Meierhenrich et al., 2004
), because it has no free carboxyl group and is thus of limited interest to research concerning biological macromolecules. However, we did not limit database contents to amino acids that possess an alpha-amino group and a free alpha-hydrogen. Although these are two features shared by all the 20 standard amino acids (Weber and Miller, 1981
), they may not be logical constraints on biochemistry (Qiu et al., 2006
), either in terms of primordial evolution or in terms of protein engineering.
Biophysical properties considered
Within our database, each amino acid is currently associated with quantitative estimates of three fundamental biophysical properties: size, charge and hydrophobicity. These three amino acid properties have long been regarded as major determinants of amino acids' bioactivity (Grantham, 1974
; Biro, 2006
), influencing not only the biochemical roles played by amino acids, but also the patterns of molecular evolution that occur and hence the expectations of bioinformatics algorithms such as alignment and phylogenetic reconstruction software (Tomii and Kanehisa, 1996
).
Choice of database strategy
There are generally two types of database employed to store and organize data: one is the relational database and the other is the XML database. Relational database technology is the older of these, and is therefore more common in current life science databases (e.g. GenBank, see Benson et al., 2006
, PDB, see Berman et al., 2000
). As a mature technology, relational databases are associated with sophisticated query languages (SQL) and development software (e.g. Oracle or MS Access). However, they suffer from inflexibility, requiring that all types of data are predefined: once the database has been created, subsequent changes are difficult to achieve and considered poor programming practice as they can easily disrupt database stability.
The newer technology of XML is based on a data description language (XML) designed to facilitate cross-platform data exchange of complex, non-homogeneous data types using customized, self-explanatory tags so that both computers and human can understand the semantics. Thus, XML databases handle irregularity of data well and are highly suited to development where not only the number of data items, but also the properties of these data items, are likely to change and expand as time proceeds (e.g. see the new Gene Expression Omnibus database from the NCBI, Barrett et al., 2005
).
A major, ongoing characteristic of amino acid research concerns the diverse biological and chemical roles that each can play: it is likely that new roles and associated measurements will continue to emerge as this science grows. In this context, an explicit goal of our database is to create a foundational resource that can expand in depth and breadth over time, as new research highlights new biophysical properties or new molecules pertinent to amino acid research. Thus, the XML is a natural choice for our work. Since our initial database is relatively small (388 entries), the slow speed that can arrive as the cost of XML flexibility is unnoticeable: orders of magnitude more information would be needed to render the speed of XML problematic, and history suggests that improvements in computing speed are likely to out match any perceptible reductions in speed caused by significant database growth.
Data structures of the database
Since one XML document in the database corresponds to one amino acid, it is easy to add a new document/new amino acid containing considerable information. Figure 1 shows a schematic overview of the data structure we use to illustrate the relationships among XML elements in an AA-QSPR Db XML file. We use the common name and Chemical Abstracts Service (CAS) registry number (SciFinder: http://www.cas.org/SCIFINDER/) to identify an amino acid and use Simplified Molecular Input Line Entry System (SMILES: see Weininger, 1988
), a linear structural representation, and molecular formula to provide general structure information. An XML tag created explicitly for this project (foundin), acts as a major classifier, defining biosynthetic, coded, abiotic and engineered amino acids by the source(s) from which an amino acid has been identified. For example, the standard amino acid alanine has been found in Murchison meteorite (Cronin and Moore, 1971
) and prebiotic chemistry experiments (Miller, 1953
), which suggest it can be synthesized abiotically; therefore, we label it as abiotic and biosynthetic inside foundin tags (see Supplementary XML file of amino acid alanine). In the element of GeneralInfo, an attribute called coded is used to differentiate whether the amino acid is one of the twenty standard amino acids. The biophysical properties of an amino acid are described in the XML element descriptors, which includes biophysical property name (e.g. log P), associated values, whether these values are predicted or experimentally determined and a reference or source for the property value (those used here have been previously evaluated for accuracy, in Lu and Freeland, 2006b
, showing >95% correlation for size and charge with experimentally determined equivalents).
|
In Supplementary material, we include the XML schema for the AA-QSPR Db. A primary aim of our database is to encourage community development: in other words, to encourage all interested parties to contribute new molecules and new information for existing molecules. We therefore provide simple web forms in which members of the user community can submit new information. To prevent vandalism, only registered users can upload new XML files into the database. However, the registration process is simple. After entering name, email addresses and affiliations, users will receive an email with a randomly generated password, which can be changed later. A correct match of username and password is necessary for a user to upload his/her data. The user-contributed amino acid XML files will contain the username of the contributors for future references. As a further safeguard, we back up the database once every 24 h, allowing us to restore it to a recent version if any major problems occur.
Database web interfaces and visualization/analysis tools
Our database includes user-friendly web functions developed to help non-computer scientists navigate the website and database. An online help manual, which includes a comprehensive tutorial, is readily accessible to users. Clearly marked on the main page (http://www.evolvingcode.net:8080/AA-QSPR/html/), links lead users to the following functions: viewing XML entries in both XML and HTML formats; searching the database with keywords (e.g. amino acid common name); downloading data (including individual XML entries or entire data sets of the database) in ASCII format or as an XML schema; creating an XML file using an online form and uploading newly created XML files to the database (registration required); converting SMILES to Structures Data File (SDF) (Dalby et al., 1992
), on-the-fly, so as to view two-dimensional amino acid molecular structures with JMOL (http://jmol.sourceforge.net/) or for downloading to users' local machine. Furthermore, we implemented an online calculation of van der Waals volume and a connection to ALOGPS (Tetko et al., 2005
) Web Service for users to predict log P values. With the help of these web functions, our database serves as both a resource and a research platform that can enrich the knowledge base of the whole scientific community.
Our database is currently equipped with two visualization tools that help users investigate the relationships between the amino acids of their interest. One is an implementation of the KING interactive three-dimensional vector graphics software (Davis et al., 2004
). This allows users to produce and manipulate interactive three-dimensional plots of chemical space for user-defined sub-sets of amino acids according to any combination of van der Waals volume, pI and log P. The second visualization method is an incorporation of the open source-package TouchGraph (http://www.touchgraph.com) which builds a minimum spanning tree from amino acids selected by users. This powerful analysis and visualization method has been widely used in many research areas to help researchers gain an intuitive insight about the complex relationships of interest (Tomii and Kanehisa, 1996
; Knight et al., 2006
; see Bulka et al., 2006
for a more detailed description of the method). Essentially, a user-defined, quantitative measure of the distance between amino acids is used to connect them into a tree structure in which adjoining elements are most similar to one another. Thus, for example, Fig. 2 shows a wide distribution of the 20 proteinaceous amino acids on major branches of the minimum spanning tree that is made up of 102 naturally occurring amino acids.
|
Both visualization tools were implemented in a way that can cope with the growing database, allowing users full control of which amino acids and which properties are incorporated into plots. Both facilitate easy interpretation of a visualization by allowing the user to define color-coded sets of amino acids for display (e.g. to contrast biologically coded amino acids within the super-set of those that are prebiotically plausible). Both tools further allow users to add, on-the-fly, any other molecules of specific interest so as to investigate the relationships between their own compounds and the amino acids of the database. Detailed instructions and examples of using these tools are available in online tutorial of AA-QSPR Db.
| Example analyses |
|---|
|
|
|---|
To illustrate the types of exploration that our database can support, here we present two simple, quick QSAR studies of peptides that include non-standard amino acids. Each recaptures the information reported from a more costly and laborious empirical study (Ufkes et al., 1982
In a previous study (Ufkes et al., 1978
, 1982
), 40 pentapeptides (Supplementary Table 1), including 10 that contain non-standard amino acids at one or two of five positions, were experimentally tested for their ability to potentiate bradykinin (a pharmacologically and physiologically active nine-mer peptide from the kinin group of proteins). Using the predicted amino acid size, charge and hydrophobicity values in our database, we created a matrix of 40 rows (one for each peptide) and 15 columns (for the three property values at each of five positions: see supplemental data). Using these as predictor (independent) variables, and the experimentally determined log RAI of each peptide (the logarithm of a relative potentiating activity index: Ufkes et al., 1978
, 1982
) as the dependent variables, we then performed partial least squares (PLS) regression analysis on the data set. This approach is well established in chemistry for multivariate linear regression (Hattotuwagama et al., 2006
; Put et al., 2006
), and although there exist several variants of the precise method (e.g. one alternative to the PLS that we use is SIMPLS), these variations are equivalent if the response is uni-dimensional (Boulesteix and Strimmer, 2007
). The PLS (performed by SAS 9.0) gave two clearly significant PLS components that together explained 78.4% of the variance (65.2% and 13.2%, respectively) in the potentiating activity that previous empirical studies had reported. The weights that PLS assigned to the 15 predictors for the first extracted factor are in Supplementary Table 2. The correlation coefficient between PLS-predicted and experimentally measured peptide log RAI values is 0.89 (Fig. 3 and Supplementary Table 1): in other words, the database allowed us, as users, to recapture in seconds the findings of a laborious and expensive empirical study with 89% accuracy.
|
Next, we performed PLS on another data set of 48 dipeptides (Supplementary Table 3) for which a quantitative threshold of bitterness of taste had been determined (Asao et al., 1987
|
Since there exist many computational chemistry programs for predicting biophysical properties (especially log P), we repeated each of these analyses using the software which we previously found to have the second best accuracy (Lu and Freeland, 2006b
Overall, these analyses are especially significant for their simplicity when placed against the increasing diversity of molecular descriptors that are entering QSAR studies (Jonsson et al., 1989
; Mei et al., 2005
). Thus, these two tests illustrate how our database can, in a few minutes, generate strong predictions of the results of relatively complex, expensive and time-consuming experiments.
| Discussion |
|---|
|
|
|---|
The AA-QSPR Db is a research tool designed to facilitate quick and easy exploration of amino acid chemical space using modern web technology. Our aim is not, of course, to replace empirical studies; rather it is to offer rapid, safe and cheap explorations of amino acid chemical space which may act to focus the time and money associated with more detailed, empirical studies. Specifically, researchers anywhere in the world can now point and click to rapidly select and explore various biophysical properties of any collection of amino acids so as to guide their analyses and experiments in protein design, origin-of-life research or bioinformatics.
| Footnotes |
|---|
Edited by Philipp Holliger
| Acknowledgements |
|---|
|
|
|---|
We would like to thank Dr Michael New at NASA, Dr Gregurick at UMBC and Dr Boulesteix at Sylvia Lawry Centre for Multiple Sclerosis Research for their insightful input. This work is supported in part by grant NNG04GJ72G from Astrobiology: Exobiology and Evolutionary Biology.
| References |
|---|
|
|
|---|
Asao M., Iwamura H., Akamatsu M., Fujita T. J. Med. Chem. (1987) 30:1873–1879.[CrossRef][ISI][Medline]
Barrett T., Suzek T.O., Troup D.B., Wilhite S.E., Ngau W.C., Ledoux P., Rudnev D., Lash A.E., Fujibuchi W., Edgar R. Nucleic Acids Res. (2005) 33:D562–D566.
Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Wheeler D.L. Nucleic Acids Res. (2006) 34:D16–D20.
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. Nucleic Acids Res. (2000) 28:235–242.
Biro J.C. Theor. Biol. Med. Model (2006) 3:15.[CrossRef][Medline]
Boulesteix A.L., Strimmer K. Brief Bioinform. (2007) 8:32–44.
Bulka B., desJardins M., Freeland S.J. BMC Bioinformatics (2006) 7:329.[CrossRef][Medline]
Cronin J.R., Gandy W.E., Pizzarello S. J. Mol. Evol. (1981) 17:265–272.[CrossRef][ISI][Medline]
Cronin J.R., Moore C.B. Science (1971) 172:1327–1329.
Cronin J.R., Pizzarello S., Yuen G.U. Geochim. Cosmochim. Acta (1985) 49:2259–2265.[CrossRef][ISI][Medline]
Cronin J.R., Pizzarello S. Geochim. Cosmochim. Acta (1986) 50:2419–2427.[CrossRef][ISI][Medline]
Cronin J.R., Pizzarello S. Science (1997) 275:951–955.[CrossRef][ISI][Medline]
Czajgucki Z., Andruszkiewicz R., Kamysz W. J. Pept. Sci. (2006) 12:653–662.[CrossRef][ISI][Medline]
Dalby A., Nourse J.G., Hounshell W.D., Gushurst A.K.I., Grier D.L., Leland B.A., Laufer J. J. Chem. Inf. Comput. Sci. (1992) 32:244–255.[CrossRef][ISI]
Davis I.W., Murray L.W., Richardson J.S., Richardson D.C. Nucleic Acids Res. (2004) 32:W615–W619.
Garrett R.H., Grisham C.M. Biochemistry (1999) Orlando, Florida: Saunders College Publishing.
Glavin D.P., Bada J.L. Astrobiology (2001) 1:259–269.[CrossRef][Medline]
Grantham R. Science (1974) 185:862–864.
Guan P., Doytchinova I.A., Walshe V.A., Borrow P., Flower D.R. J. Med. Chem. (2005) 48:7418–7425.[CrossRef][ISI][Medline]
Haidacher D., Vailaya A., Horvath C. Proc. Natl Acad. Sci. USA (1996) 93:2290–2295.
Hattotuwagama C.K., Toseland C.P., Guan P., Taylor D.J., Hemsley S.L., Doytchinova I.A., Flower D.R. J. Chem. Inf. Model (2006) 46:1491–1502.[CrossRef][ISI][Medline]
Hendrickson T.L., de Crecy-Lagard V., Schimmel P. Annu. Rev. Biochem. (2004) 73:147–176.[CrossRef][ISI][Medline]
Jonsson J., Eriksson L., Hellberg S., Sjostrom M., Wold S. Quant. Struct. Act. Relat. (1989) 8:204–209.[CrossRef]
Kawashima S., Ogata H., Kanehisa M. Nucleic Acids Res. (1999) 27:368–369.
Knight C.G., Zitzmann N., Prabhakar S., Antrobus R., Dwek R., Hebestreit H., Rainey P.B. Nat. Genet. (2006) 38:1015–1022.[CrossRef][ISI][Medline]
Link A.J., Mock M.L., Tirrell D.A. Curr. Opin. Biotechnol. (2003) 14:603–609.[CrossRef][ISI][Medline]
Lu Y., Freeland S.J. Genome Biol. (2006a) 7:102.[CrossRef][Medline]
Lu Y., Freeland S.J. Astrobiology (2006b) 6:606–624.[CrossRef][ISI][Medline]
Mei H., Liao Z.H., Zhou Y., Li S.Z. Biopolymers (2005) 80:775–786.[CrossRef][ISI][Medline]
Meierhenrich U.J., Munoz Caro G.M., Bredehoft J.H., Jessberger E.K., Thiemann W.H. Proc. Natl Acad. Sci. USA (2004) 101:9182–9186.
Meylan W.M., Howard P.H. J. Pharm. Sci. (1995) 84:83–92.[CrossRef][ISI][Medline]
Miller S.L. Science (1953) 117:528–529.
Miller S.L. Chem. Scr. (1986) 26B:5–11.
Put R., Daszykowski M., Baczek T., Vander Heyden Y. J. Proteome Res. (2006) 5:1618–1625.[CrossRef][ISI][Medline]
Qiu J.X., Petersson E.J., Matthews E.E., Schepartz A. J. Am. Chem. Soc. (2006) 128:11338–11339.[CrossRef][ISI][Medline]
Summerer D., Chen S., Wu N., Deiters A., Chin J.W., Schultz P.G. Proc. Natl Acad. Sci. USA (2006) 103:9785–9789.
Tetko I.V., et al. J. Comput. Aided Mol. Des. (2005) 19:453–463.[CrossRef][ISI][Medline]
Tomii K., Kanehisa M. Protein Eng. (1996) 9:27–36.
Ufkes J.G., Visser B.J., Heuver G., Van der Meer C. Eur. J. Pharmacol. (1978) 50:119–122.[CrossRef][ISI][Medline]
Ufkes J.G., Visser B.J., Heuver G., Wynne H.J., Van der Meer C. Eur. J. Pharmacol. (1982) 79:155–158.[CrossRef][ISI][Medline]
Uy R., Wold F. Science (1977) 198:890–896.
Venton B.J., Robinson T.E., Kennedy R.T., Maren S. Eur. J. Neurosci. (2006) 23:3391–3398.[CrossRef][ISI][Medline]
Weber A.L., Miller S.L. J. Mol. Evol. (1981) 17:273–284.[CrossRef][ISI][Medline]
Wang L., Xie J., Schultz P.G. Annu. Rev. Biophys. Biomol. Struct. (2006) 35:225–249.[CrossRef][ISI][Medline]
Weininger D. J. Chem. Inf. Comput. Sci. (1988) 28:31.[CrossRef][ISI]
Received January 1, 2007; revised April 9, 2007; accepted May 17, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



