Protein Engineering vol. 4 no. 2 pp. 149-154, 1990
© 1990 Oxford University Press
RESEARCH-ARTICLE |
Tests for the statistical significance of protein sequence similarities in data-bank searches
Laboratory of Mathematical Biology. National Institute for Medical Research The Ridgeway, Mill Hill, London NW7 IAA 2Department of Applied Statistics. University of Reading Whiteknights. Reading R06 2AN, UK
1To whom correspondence should be addressed
A suite of tests to evaluate the statistical significance of protein sequence similarities is developed for use in data bank searches. The tests are based on the Wilbur Lipman word-search algorithm, and take into account the sequence lengths and compositions, and optionally the weighting of amino acid matches. The method is extended to allow for the existence of a sequence insertion/deletion within the region of similarity. The accuracy of statistical distributions underlying the tests is validated using randomly generated sequences and real sequences selected at random from the data banks. A computer program to perform the tests is briefly described.
Keywords: data-bank searches/sequence similarity/statistical significance/Wilbur Lipman word-search algorithm
Received June 13, 1990; accepted September 26, 1990.