ASES: An Approximate Search Engine for Structure
IIS-9988345 (PI: Dennis Shasha) and IIS-9988636 (PI: Jason T. L. Wang)
Dennis Shasha, Professor
Courant Institute of Mathematical Sciences
New York University
New York, New York 10012
Phone: (212) 998-3086
Fax: (212) 995-4123
E-mail: shasha@cs.nyu.edu
URL: http://cs.nyu.edu/cs/faculty/shasha
Jason T. L. Wang, Professor
College of Computing Sciences
New Jersey Institute of Technology
Newark, New Jersey 07102
Phone: (973) 596-3396
Fax: (973) 596-5777
E-mail: wangj@njit.edu
URL: http://web.njit.edu/~wangj
Keywords
Search methods and
software, structural data and XML documents, biology and chemistry, indexing
techniques and query languages, pattern matching and recognition, scientific
data mining
Project Summary
The goal of this research project is to make it possible to process approximate queries
on combinatorial structures such as trees and graphs in an efficient manner.
The ultimate goal is to be as fast as approximate keyword search in text documents.
The approach consists of: (1) discovering the best heuristics for answering these NP-complete problems
in polynomial time with high quality,
and (2) discovering data structures to make queries on thousands or millions of such objects fast.
The results of this project will provide techniques and software to search for
patterns among graph and tree structured data.
Possible applications of such a search engine include searches
among proteins, chemical compounds, neuroanatomical structures, Web/text filters and XML documents.
The techniques developed in this research are in particular suitable for bioinformatics and Web applications.
Publications and Products
Papers:
-
C. Di Pietro, A. Ferro, T. Maugeri, G. Pigola, A. Pulvirenti, D. Shasha,
V. Di Pietro, G. Emmanuele, E. Modica, M. Purrello, M. Ragusa, M. Scalia,
S. Travali and V. Zimmitti. ANTICLUSTAL: Multiple Sequence Alignment by
Antipole Clustering and Linear Approximate 1-Median Selection.
Proceedings of the IEEE Computational Systems Bioinformatics Conference,
Stanford, California, 2003.
-
Y. Zhu and D. Shasha.
Efficient Elastic Burst Detection in Data Streams.
Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2003.
-
A. Lerner and D. Shasha.
AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments.
Proceedings of the International Conference on Very Large Data Bases, 2003.
-
J. T. L. Wang, H. Shan, D. Shasha and W. H. Piel.
TreeRank: A Similarity Measure for Nearest Neighbor Searching
in Phylogenetic Databases.
Proceedings of the 15th International Conference on Scientific
and Statistical Database Management, 2003, pp. 171-180.
-
Y. Zhu and D. Shasha. Warping Indexes
with Envelope Transforms for Query by Humming.
Proceedings of the ACM SIGMOD International Conference
on Management of Data, San Diego, California, 2003, pp. 181-192.
-
Y. Zhu and D. Shasha.
Fast Approaches to Simple Problems in Financial Time Series Streams.
Proceedings of the ACM SIGMOD Workshop on Management and Processing of Data Streams,
San Diego, California, 2003.
-
X. Wang, J. T. L. Wang, D. Shasha, B. A. Shapiro, I. Rigoutsos and K. Zhang.
Finding Patterns in Three Dimensional Graphs:
Algorithms and Applications to Scientific Data Mining.
IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 4, 2002, pp. 731-749.
-
S. Zhang, J. T. L. Wang and K. G. Herbert.
XML Query by Example.
International Journal of Computational Intelligence and Applications,
Vol. 2, No. 3, 2002, pp. 329-337.
-
J. T. L. Wang, K. Zhang, G. Chang and D. Shasha.
Finding Approximate Patterns in Undirected Acyclic Graphs.
Pattern Recognition, Vol. 35, No. 2, 2002, pp. 473-483.
-
M. M. Yin and J. T. L. Wang.
Mining Genes in DNA Using GeneScout.
Proceedings
of the IEEE International Conference on Data Mining, Maebashi
City, Japan, 2002, pp. 733-736.
-
Y. Zhu and D. Shasha.
StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time.
Proceedings of the International Conference on Very Large Data Bases,
Hong Kong, China, 2002, pp. 358-369.
-
S. Zhang, L. Liao, J.-F. Tomb and J. T. L. Wang.
Clustering and Classifying Enzymes in Metabolic Pathways:
Some Preliminary Results.
Proceedings of the 2nd ACM SIGKDD Workshop on Data Mining
in Bioinformatics, Edmonton, Alberta, Canada, 2002, pp. 19-24.
-
D. Shasha, J. T. L. Wang, H. Shan and K. Zhang.
ATreeGrep: Approximate Searching in Unordered Trees.
Proceedings of the 14th International Conference on Scientific
and Statistical Database Management, Edinburgh, Scotland, 2002, pp. 89-98.
-
H. Shan, K. G. Herbert, W. Piel, D. Shasha and J. T. L. Wang.
A Structure-Based Search Engine for Phylogenetic Databases.
Proceedings of the 14th International Conference on
Scientific and Statistical Database Management,
Edinburgh, Scotland, 2002, pp. 7-10.
-
D. Shasha, J. T. L. Wang and R. Giugno.
Algorithmics and Applications of Tree and Graph Searching.
Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,
Madison, Wisconsin, 2002, pp. 39-52.
-
Q. Ma, J. T. L. Wang and J. R. Gattiker.
Mining Biomolecular Data Using Background Knowledge and
Artificial Neural Networks.
In Handbook of Massive Data Sets,
(eds. J. Abello, P. M. Pardalos and M. G. C. Resende),
Chapter 30, Kluwer Academic, 2002, pp. 1141-1168.
-
M. M. Yin and J. T. L. Wang.
Effective Hidden Markov Models for Detecting Splicing Junction Sites
in DNA Sequences. Information Sciences, Vol. 139, 2001, pp. 139-163.
-
D. Shasha, A. Kouranov, L. Lejay, M. Chou and G. Corruzi.
Using Combinatorial Design to Study Regulation by Multiple Input Signals.
A Tool for Parsimony in the Post-Genomics Era.
Plant Physiology, Vol. 127, No. 4, 2001, pp. 1590-1594.
-
K. Birnbaum, P. N. Benfey and D. E. Shasha.
cis Element/Transcription Factor Analysis (cis/TF):
A Method for Discovering Transcription Factor/cis Element Relationships.
Genome Res., Vol. 11, 2001, pp. 1567-1573.
-
Q. Ma, J. T. L. Wang, D. Shasha and C. H. Wu.
DNA Sequence Classification via an Expectation Maximization Algorithm
and Neural Networks: A Case Study.
IEEE Transactions on Systems, Man, and Cybernetics,
Part C: Applications and Reviews, Vol. 31, No. 4, 2001, pp. 468-475.
-
J. T. L. Wang, Q. Ma, D. Shasha and C. H. Wu.
New Techniques for Extracting Features from Protein Sequences.
IBM Systems Journal,
Special Issue on Deep Computing for Life Sciences, Vol. 40, No. 2, 2001, pp. 426-441.
-
M. Cochinwala, V. Kurien, G. Lalk and D. Shasha.
Efficient Data Reconciliation. Information Sciences,
Vol. 137, 2001, pp. 1-15.
-
J. T. L. Wang and K. Zhang.
Finding Similar Consensus between Trees: An Algorithm and a Distance Hierarchy.
Pattern Recognition, Vol. 34, No. 1, 2001, pp. 127-137.
-
J. T. L. Wang, Q. Ma and K. G. Herbert.
Software Engineering and Knowledge Engineering
Issues in Bioinformatics.
In Handbook of Software Engineering and Knowledge Engineering,
Vol. 1, Fundamentals, (ed. S. K. Chang),
Chapter 30, World Scientific,
2001, pp. 719-732.
-
X. Wang, J. T. L. Wang, K.-I. Lin, D. Shasha, B. A. Shapiro and K. Zhang.
An Index Structure for Data Mining and Clustering.
Knowledge and Information Systems, Vol. 2, 2000, pp. 161-184.
-
J. T. L. Wang and K. Zhang.
Identifying Consensus of Trees through Alignment.
Information Sciences, Vol. 126, 2000, pp. 165-189.
-
J. T. L. Wang, Q. Ma, D. Shasha and C. H. Wu.
Application of Neural Networks to Biological Data Mining:
A Case Study in Protein Sequence Classification.
Proceedings of the 6th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, Boston, Massachusetts, 2000, pp. 305-309.
-
X. Wang and J. T. L. Wang.
Fast Similarity Search in Three-Dimensional Structure Databases.
Journal of Chemical Information and Computer Sciences, Vol. 40, 2000, pp. 442-451.
-
J. T. L. Wang, X. Wang, D. Shasha, B. A. Shapiro, K. Zhang, Q. Ma and Z. Weinberg.
An Approximate Search Engine for Structural Databases.
Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas, 2000, pp. 584.
Books:
-
J. T. L. Wang, C. H. Wu and P. P. Wang, eds.
Computational Biology and Genome Informatics.
World Scientific, Singapore, 2003, ISBN 981-238-257-7.
-
D. Shasha and P. Bonnet.
Database Tuning: Principles, Experiments, and Troubleshooting Techniques.
Morgan Kaufmann, San Francisco, California, 2002, ISBN 1-55860-753-6.
-
G. Chang, M. J. Healey, J. A. M. McHugh and J. T. L. Wang.
Mining the World Wide Web: An Information Search Approach.
Kluwer Academic, Norwell, Massachusetts, 2001, ISBN 0-7923-7349-9.
Project Impact
As of today, users who have downloaded our graph or
tree comparison software or search tools have come
from academia and industry. Some sample affiliations include:
Arizona State University, Carnegie Mellon University,
Gateway, IBM, Indian Institute of Technology, McGill University,
Microsoft, MIT, National Cancer Institute, National University of
Singapore, Novartis Pharmaceuticals Corporation, University at
Buffalo, University of Calgary, University of California,
University of Michigan, agh.edu.pl, cse.unsw.edu.au,
gen-strategies.com, geometricsoftware.com, infoglide.com,
informatik.uni-tuebingen.de, jnana.com, law.kuleuven.ac.be,
legacy2web.com, lsi.upc.es, mcculloch.ing.unifi.it,
messagepharm.com, mrc-dunn.cam.ac.uk, scnsoft.com, si.ehu.es,
synopsys.com, syseca.thomson-csf.com, uea.ac.uk, uow.edu.au,
uq.edu.au, respectively. Our new tree searching algorithms
have been integrated into the TreeBase system at Harvard University.
This software allows the user to perform both structural and
keyword searches in an efficient manner. In summary,
we are providing a toolkit for discoveries at the frontiers,
for data analysis in science but also for commerce.
Goals, Objectives and Targeted
Activities
Our long-term goal is to provide scientists with efficient tools for
approximate searches in structural databases.
Our main applications and users so far have been in biochemistry and phylogenetics,
but we are currently exploring possible applications in XML document management.
In the coming year, we will focus on
(i) improving and parallelizing our current tools for tree and graph searching
so that they can be as fast as keyword search engines like google,
(ii) developing practically meaningful distance measures
on trees and graphs and approximate query processing algorithms to support inexact matching,
and (iii) developing a framework for turning searching to pattern discovery in trees and graphs.
Area Background
Our project is concerned with efficient search in structural databases.
Most of our work concentrates on combinatorial structures such as trees and graphs.
Typical questions are "Given a query structure and a set of combinatorial structures,
which combinatorial structure approximately contains the query structure?"
We have developed efficient algorithms to answer these questions
and are applying our techniques to phylogenetic trees and XML documents.
Potential Related Projects
Related projects on
XML databases and computational biology conducted in California, Michigan,
Pennsylvania, Texas, Washington, Wisconsin, etc. within the NSF IDM program can
be found at the web site
http://www.cise.nsf.gov/iis/.
Project Websites
http://web.njit.edu/~wangj/sigmod.html
(a general introduction to the project)
Illustrations
http://web.njit.edu/~wangj/SIGMOD/demoposter_2000.html
(a demo concerning 3D substructure search)
http://aria.njit.edu/biotool/search_index.html
(a demo of a search engine for phylogenetic trees)
Online Software
http://cs.nyu.edu/cs/faculty/shasha/papers/tree.html
(software for approximate tree comparison for ordered trees
where sibling order matters)
http://cs.nyu.edu/cs/faculty/shasha/papers/treesearch.html
(software for data structure setup for searching ordered and
unordered trees)
http://www.cs.nyu.edu/cs/faculty/shasha/papers/cousins.html
(software for finding frequent cousin patterns in unordered trees)
http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html
(software for approximate graph comparison)
http://cs.nyu.edu/cs/faculty/shasha/papers/graphgrep
(software for graph searching)