Pattern Discovery in Combinatorial Databases: Algorithms, Applications, and Software for the Scientific Community

Project Summary

Combinatorial databases store structures such as sequences, trees, and graphs as well as simple attributes. Such databases arise in biology, chemistry, linguistics, document retrieval, botany, neuroanatomy, and many other fields. An important set of queries on such databases concern the comparison and retrieval of similar combinatorial structures; for example, find the graph in the database that is most similar to a certain pattern graph according to a given pattern metric M. Such pattern matching queries are useful when comparing a new protein or chemical to an existing set. A less constrained but equally important problem is pattern discovery which is to find the pattern that approximately characterizes a set of structures in the database given a pattern metric. Pattern discovery is computationally expensive---sometimes requiring more than a day to complete an analysis of even a moderately sized database---the good news is that it is easily parallelizable.

The first and main goal of our proposed research is to provide software to the scientific community embodying pattern matching and pattern discovery algorithms of our own or others' invention. This will constitute a "discovery framework" whereby matching algorithms can semi-automatically be converted to pattern discovery algorithms in this domain.

As a second goal and to overcome the computational time barrier, our software will incorporate a tool developed for fault-tolerant parallel pattern discovery on idle workstations. This "Persistent Linda" software can apply to pattern discovery of all types (e.g., record-oriented data mining) not just for combinatorial structures. Recently an experiment with 20 machines at AT&T Bell Laboratories in Whippany has shown the technology to be very effective, though the system has some scalability problems.

A third goal is to continue our work on the discovery and exploitation of query processing heuristics for combinatorial data mining. The most important will be a new data structure for searching for objects based on an arbitrary (possibly infinite dimensional) distance metric.

A fourth goal is to explore query language issues, because there have been requests for such a language from our user community. In addition to the classic query language goals of selecting sets of items based on attribute values, such a language must be able to select substructures within items based on an approximation metric.


This material is based upon work partly supported by the United States National Science Foundation under grant IRI-9531548 (1996-2000). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This support is greatly appreciated.