Discovering Motifs in Scientific Databases

Project Summary

Scientific progress often results from reducing raw data to abstraction that one can reason about. In molecular biology, the raw data is sequence, tree, graph, or geometrical information and the abstract patterns may be sequence motifs that express certain functionality. Helping scientists discover such patterns is the goal of this work. Pattern discovery will entail generating pattern guesses in a systematic way and testing them. The tests will be based on approximate pattern matching algorithms that yield distance metrics. The main contribution of this research project will be a family of algorithms for pattern discovery, query processing, data organization and index manipulation. Whereas some of our algorithms will be specific to the combinatorial structures present in biology, many of the techniques should generalize to any application that seeks to find patterns in databases.


This material is based upon work partly supported by the United States National Science Foundation under grant IRI-9224602 (1992-1997). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This support is greatly appreciated.