Andrew F.W. Coulson
Biocomputing Research Unit, ICMB, Darwin Building, University of Edinburgh, Kings Buildings, Mayfield Road, Edinburgh EH9 3JR, United Kingdom, a.coulson@ed.ac.uk

Fold recognition by sequence similarity

Two sequence databases were created. 'U' consisted of the Swissprot sequence entries corresponding to Sander's representative set of structures. 'EU' consisted of the Swissprot entries for all the sequences referenced in HSSP. In other words, it contains all sequence entries any part(s) of which are so strongly similar to proteins of known structure that (these parts) can be unequivocally aligned with a known fold. Two other databases were also used - NRL3D and Swissprot.

Searches were made with each unknown sequence against one or more of these databases, using the Smith and Waterman 'Best Local Similarity' algorithm, implemented on the AMT Distributed Array Processor, or the MasPar MP-1. Scoring tables derived by the Dayhoff prescription for various evolutionary distances, expressed in 'PAM's' (1 PAM corresponds to 1 accepted point mutation per 100 residues of sequence). Conservative gap penalties were used, in a range in which experience shows that the appearance and length of gaps is not strongly dependent on the gap penalty value.

The output lists were scanned for significant similarities to proteins of known (or reliably inferred) structure; using the statistical criterion described in Methods in Enzymology 1990,183,474-486. Confirmation of potential positive hits was sought by repeated searches, using scoring tables with a varied range of PAM parameters, and by identification of a multiplicity of less significant hits on sequences of the same structure.

Finally, the sequence surroundings of a potential positive hit were examined by sequence comparison using a 'flat' comparison table (generated with a high PAM value), and the structure surroundings by molecular graphics and secondary structure prediction. Reproted predictions are cases in which a significant local similarity could be plausibly extended over a substantial part of the unknown sequence.

Assessment of the performance will require access to all the 'threading' example structures (not all predictions were submitted, either because they were negative, or because of pressure of time), but it is clear that none of the databases used was ideal. The most important improvement to the method would probably be by the construction of a better database, in which sequences were flagged by a structural classification.

Structural Biology home page
Asilomar Conference home page
LLNL Disclaimer
Web page maintained by BBRP Webmaster (BBRPWebmaster@humpty.llnl.gov).
CONF-941241
Last modified on 1-11-95