Protein Engineering, Vol. 12, No. 12, 1029-1030,
December 1999
© 1999 Oxford University Press
COMMUNICATIONS |
Rational structural genomics: affirmative action for ORFans and the growth in our structural knowledge
Faculty of Natural Science, Department of Mathematics and Computer Science, Ben Gurion University, Beer-Sheva 84015, Israel
| Introduction |
|---|
|
|
|---|
The determination of the complete genome sequences of organisms is producing an avalanche of protein sequences awaiting further structural and functional interpretation. Only a small fraction of the proteins encoded in these genomes has been experimentally studied, but putative functions for roughly 70% of the ORFs can be assigned via homology with characterized proteins in the databases. Similarly, although only a very small number of structures have been determined for these proteins, putative three-dimensional (3D) structures can currently be assigned to roughly 30% of the ORFs using fold assignment computational methods. Here I address the following questions. How fast is our structural knowledge growing? What is the distribution of assigned folds in the different functional categories? How might structure determination efforts be prioritized for maximum information and impact?
I have analyzed the 3D fold assignments for the genome of Mycoplasma genitalium (Fraser et al., 1995
), which due to its small size has served as a minimal model organism for various studies. Several publications have reported different fractions of the genome for which 3D folds can be assigned. The earliest works reported fractions as low as 9 and 12% (Casari et al., 1996
; Frishman and Mewes, 1997
; Gerstein, 1997
). Later works using methods aimed at detecting more distant relationships have increased this fraction to 25% (Fischer and Eisenberg, 1997
), and more recently, up to around 40% (Huynen et al., 1998; Rychlewski et al., 1988; Teichmann et al., 1998; Jones, 1999; Wolf et al., 1999 and others; for recent reviews on this topic see Fischer and Eisenberg, 1999a; Teichmann et al., 1999). The differences in the reported fractions depend mainly on (i) the methods' sensitivities (the rate of true positives) and their selectivities (the rate of false positives); (ii) whether assignments are accounted for full structural domain matches or for only small sequencestructure segments and (iii) the date that the study was done (which determines the number of known sequences and structures and hence the number of sequences that can be assigned to known folds).
To evaluate how much the increase in the fraction of assignable ORFs depends on the number of available folds, I have compared the fold assignment of M.genitalium proteins obtained by one particular method using three different sets of structures. The method used in this comparison (Fischer and Eisenberg, 1997
) is aimed at detecting full structural domain matches and uses rather conservative thresholds (the method chosen to carry out this comparison is irrelevant; qualitatively similar results are likely to be obtained with any other method). When using only those structures available before 1996 only 20% of the genome could be assigned a fold. With structures from the PDB available in April 1997, 25% of the genome was assigned a fold (Fischer and Eisenberg, 1997
). When using all the structures available in October 1998, the fraction of assigned proteins reached 32% (see http: //www.doe-mbi.ucla.edu/people/frsvr/preds/MG/MG.html).This indicates that because of the availability of more structures, the fraction of assignable ORFs has increased at an annual rate of roughly 18% (Fischer and Eisenberg, 1999a; see also Teichmann et al., 1999 and references therein).
Will the rate of increase in fold assignment be sustained throughout the next few years? To address this question, I have analyzed the distribution of the fold assignments of M.genitalium among the various functional categories described by Fraser et al. (1995). Table I
shows that the three categories with the largest percentages of folds assigned are purine metabolism, energy metabolism and translation-tRNA. For example, all but two ORFs in the first category have been assigned a fold. As expected, and mostly due to the difficulties in determining the structures of membrane proteins, the three least covered categories are cell envelope, unknown and transport. The last column in Table I
shows that the largest number of non-membrane proteins with no assigned fold belong in the unknown and ribosomal categories (ORFs characterized as membranal or with putative transmembrane helices were excluded).
|
The fraction of assignable ORFs will undoubtedly continue to grow in the next few years, because new structures will continue to be determined in most of the functional categories. However, because in several functional categories only a few ORFs lack structural assignments, if structure determination continues to concentrate on the best represented categories, the fraction of assignable ORFs will soon reach a plateau. A `rational' approach to structural genomics (Fischer and Eisenberg, 1997
| Notes |
|---|
1 To whom correspondence should be addressed; email: dfischer{at}cs.bgu.ac.il
| References |
|---|
|
|
|---|
Casari,G., Ouzounis,C., Valencia,A. and Sander,C. (1996) GeneQuiz II: Automatic Function Assignment for Genome Sequence Analysis. In First Annual Pacific Symposium on Biocomputing. World Scientific, Hawaii, pp. 707709.
Dujon,B. et al. (1994) Nature, 369, 371377.[Medline]
Fischer,D. and Eisenberg,D. (1997) Proc. Natl Acad. Sci. USA, 94, 1192911934.
Fischer,D. and Eisenberg,D. (1999a) Curr. Opin. Struct. Biol., 9, 208211.[Web of Science][Medline]
Fischer,D. and Eisenberg,D. (1999b) Bioinformatics, 15, 759762.
Fraser,C. et al. (1995) Science, 270, 397403.
Frishman,D. and Mewes,H.-W. (1997) Nature Struct. Biol., 4, 626628.[Web of Science][Medline]
Gerstein,M. (1997) J. Mol. Biol., 274, 562576.[Web of Science][Medline]
Goffeau,A. et al. (1996) Science, 274, 546547.
Huynen,M., Doerks,T., Eisenhaber,F., Orengo,C., Sunyaev,S., Yuan,Y. and Bork,P. (1998) J. Mol. Biol., 280, 323326.[Web of Science][Medline]
Jones,D. (1999) J. Mol. Biol., 287, 797815.[Web of Science][Medline]
Kim,S.H. (1997) Nature Struct. Biol., 5, 643645.
Rost,B. (1998) Structure, 6, 259263.[Medline]
Rychlewski,L., Zhang,B. and Godzik,A. (1998) Folding Des., 3, 229236.[Web of Science][Medline]
Teichmann,S., Park,J. and Chothia,C. (1998)Proc. Natl Acad. Sci. USA,95, ???-???.
Teichmann,S., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390399.[Web of Science][Medline]
Wolf,Y., Brenner,S., Bash,P. and Koonin,E. (1999) Genom. Res., 9, 1726.
Zarembinski,T. et al. (1998) Proc. Natl Acad. Sci. USA, 95, 1518915193.
Received June 28, 1999; revised September 9, 1999; accepted September 9, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. L. Marsden, D. Lee, M. Maibaum, C. Yeats, and C. A. Orengo Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Nucleic Acids Res., February 15, 2006; 34(3): 1066 - 1080. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Y. Galperin and E. V. Koonin 'Conserved hypothetical' proteins: prioritization of targets for experimental study Nucleic Acids Res., October 12, 2004; 32(18): 5452 - 5463. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Siew, Y. Azaria, and D. Fischer The ORFanage: an ORFan database Nucleic Acids Res., January 1, 2004; 32(90001): D281 - 283. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Balasubramanian, T. Schneider, M. Gerstein, and L. Regan Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome Nucleic Acids Res., August 15, 2000; 28(16): 3075 - 3082. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
