Skip Navigation

This Article
Right arrow Full Text Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow An erratum has been published
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by May, A. C. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by May, A. C. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Protein Engineering, Vol. 14, No. 4, 209-217, April 2001
© 2001 Oxford University Press

Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics

Alex C. W. May

Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.E-mail: amay{at}nimr.mrc.ac.uk

Hierarchical classification is probably the most popular approach to group-related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g. `significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster-related proteins without the need for an a priori threshold, first, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a `core–periphery' structure within families. The `core' of a protein family comprises the most typical sequences while the `periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Protein Eng Des SelHome page
W. R. Taylor, J. Heringa, F. Baud, and T. P. Flores
A Fourier analysis of symmetry in protein structure
Protein Eng. Des. Sel., February 1, 2002; 15(2): 79 - 89.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.