IDM Workshop Template

ASES: An Approximate Search Engine for Structure

IIS-9988345 (PI: Dennis Shasha) and IIS-9988636 (PI: Jason T. L. Wang)

Dennis Shasha, Professor
Courant Institute of Mathematical Sciences
New York University
New York, New York 10012
Phone: (212) 998-3086
Fax: (212) 995-4123
E-mail: shasha@cs.nyu.edu
URL: http://cs.nyu.edu/cs/faculty/shasha

Jason T. L. Wang, Professor
College of Computing Sciences
New Jersey Institute of Technology
Newark, New Jersey 07102
Phone: (973) 596-3396
Fax: (973) 596-5777
E-mail: wangj@njit.edu
URL: http://web.njit.edu/~wangj

Keywords

Search methods and software, structural data and XML documents, biology and chemistry, indexing techniques and query languages, pattern matching and recognition, scientific data mining

Project Summary

The goal of this research project is to make it possible to process approximate queries on combinatorial structures such as trees and graphs in an efficient manner. The ultimate goal is to be as fast as approximate keyword search in text documents. The approach consists of: (1) discovering the best heuristics for answering these NP-complete problems in polynomial time with high quality, and (2) discovering data structures to make queries on thousands or millions of such objects fast. The results of this project will provide techniques and software to search for patterns among graph and tree structured data. Possible applications of such a search engine include searches among proteins, chemical compounds, neuroanatomical structures, Web/text filters and XML documents. The techniques developed in this research are in particular suitable for bioinformatics and Web applications.

Publications and Products

Papers:

C. Di Pietro, A. Ferro, T. Maugeri, G. Pigola, A. Pulvirenti, D. Shasha, V. Di Pietro, G. Emmanuele, E. Modica, M. Purrello, M. Ragusa, M. Scalia, S. Travali and V. Zimmitti. ANTICLUSTAL: Multiple Sequence Alignment by Antipole Clustering and Linear Approximate 1-Median Selection. Proceedings of the IEEE Computational Systems Bioinformatics Conference, Stanford, California, 2003.
Y. Zhu and D. Shasha. Efficient Elastic Burst Detection in Data Streams. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
A. Lerner and D. Shasha. AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments. Proceedings of the International Conference on Very Large Data Bases, 2003.
J. T. L. Wang, H. Shan, D. Shasha and W. H. Piel. TreeRank: A Similarity Measure for Nearest Neighbor Searching in Phylogenetic Databases. Proceedings of the 15th International Conference on Scientific and Statistical Database Management, 2003, pp. 171-180.
Y. Zhu and D. Shasha. Warping Indexes with Envelope Transforms for Query by Humming. Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, California, 2003, pp. 181-192.
Y. Zhu and D. Shasha. Fast Approaches to Simple Problems in Financial Time Series Streams. Proceedings of the ACM SIGMOD Workshop on Management and Processing of Data Streams, San Diego, California, 2003.
X. Wang, J. T. L. Wang, D. Shasha, B. A. Shapiro, I. Rigoutsos and K. Zhang. Finding Patterns in Three Dimensional Graphs: Algorithms and Applications to Scientific Data Mining. IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 4, 2002, pp. 731-749.
S. Zhang, J. T. L. Wang and K. G. Herbert. XML Query by Example. International Journal of Computational Intelligence and Applications, Vol. 2, No. 3, 2002, pp. 329-337.
J. T. L. Wang, K. Zhang, G. Chang and D. Shasha. Finding Approximate Patterns in Undirected Acyclic Graphs. Pattern Recognition, Vol. 35, No. 2, 2002, pp. 473-483.
M. M. Yin and J. T. L. Wang. Mining Genes in DNA Using GeneScout. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, 2002, pp. 733-736.
Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. Proceedings of the International Conference on Very Large Data Bases, Hong Kong, China, 2002, pp. 358-369.
S. Zhang, L. Liao, J.-F. Tomb and J. T. L. Wang. Clustering and Classifying Enzymes in Metabolic Pathways: Some Preliminary Results. Proceedings of the 2nd ACM SIGKDD Workshop on Data Mining in Bioinformatics, Edmonton, Alberta, Canada, 2002, pp. 19-24.
D. Shasha, J. T. L. Wang, H. Shan and K. Zhang. ATreeGrep: Approximate Searching in Unordered Trees. Proceedings of the 14th International Conference on Scientific and Statistical Database Management, Edinburgh, Scotland, 2002, pp. 89-98.
H. Shan, K. G. Herbert, W. Piel, D. Shasha and J. T. L. Wang. A Structure-Based Search Engine for Phylogenetic Databases. Proceedings of the 14th International Conference on Scientific and Statistical Database Management, Edinburgh, Scotland, 2002, pp. 7-10.
D. Shasha, J. T. L. Wang and R. Giugno. Algorithmics and Applications of Tree and Graph Searching. Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, 2002, pp. 39-52.
Q. Ma, J. T. L. Wang and J. R. Gattiker. Mining Biomolecular Data Using Background Knowledge and Artificial Neural Networks. In Handbook of Massive Data Sets, (eds. J. Abello, P. M. Pardalos and M. G. C. Resende), Chapter 30, Kluwer Academic, 2002, pp. 1141-1168.
M. M. Yin and J. T. L. Wang. Effective Hidden Markov Models for Detecting Splicing Junction Sites in DNA Sequences. Information Sciences, Vol. 139, 2001, pp. 139-163.
D. Shasha, A. Kouranov, L. Lejay, M. Chou and G. Corruzi. Using Combinatorial Design to Study Regulation by Multiple Input Signals. A Tool for Parsimony in the Post-Genomics Era. Plant Physiology, Vol. 127, No. 4, 2001, pp. 1590-1594.
K. Birnbaum, P. N. Benfey and D. E. Shasha. cis Element/Transcription Factor Analysis (cis/TF): A Method for Discovering Transcription Factor/cis Element Relationships. Genome Res., Vol. 11, 2001, pp. 1567-1573.
Q. Ma, J. T. L. Wang, D. Shasha and C. H. Wu. DNA Sequence Classification via an Expectation Maximization Algorithm and Neural Networks: A Case Study. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 31, No. 4, 2001, pp. 468-475.
J. T. L. Wang, Q. Ma, D. Shasha and C. H. Wu. New Techniques for Extracting Features from Protein Sequences. IBM Systems Journal, Special Issue on Deep Computing for Life Sciences, Vol. 40, No. 2, 2001, pp. 426-441.
M. Cochinwala, V. Kurien, G. Lalk and D. Shasha. Efficient Data Reconciliation. Information Sciences, Vol. 137, 2001, pp. 1-15.
J. T. L. Wang and K. Zhang. Finding Similar Consensus between Trees: An Algorithm and a Distance Hierarchy. Pattern Recognition, Vol. 34, No. 1, 2001, pp. 127-137.
J. T. L. Wang, Q. Ma and K. G. Herbert. Software Engineering and Knowledge Engineering Issues in Bioinformatics. In Handbook of Software Engineering and Knowledge Engineering, Vol. 1, Fundamentals, (ed. S. K. Chang), Chapter 30, World Scientific, 2001, pp. 719-732.
X. Wang, J. T. L. Wang, K.-I. Lin, D. Shasha, B. A. Shapiro and K. Zhang. An Index Structure for Data Mining and Clustering. Knowledge and Information Systems, Vol. 2, 2000, pp. 161-184.
J. T. L. Wang and K. Zhang. Identifying Consensus of Trees through Alignment. Information Sciences, Vol. 126, 2000, pp. 165-189.
J. T. L. Wang, Q. Ma, D. Shasha and C. H. Wu. Application of Neural Networks to Biological Data Mining: A Case Study in Protein Sequence Classification. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, 2000, pp. 305-309.
X. Wang and J. T. L. Wang. Fast Similarity Search in Three-Dimensional Structure Databases. Journal of Chemical Information and Computer Sciences, Vol. 40, 2000, pp. 442-451.
J. T. L. Wang, X. Wang, D. Shasha, B. A. Shapiro, K. Zhang, Q. Ma and Z. Weinberg. An Approximate Search Engine for Structural Databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas, 2000, pp. 584.

Books:

J. T. L. Wang, C. H. Wu and P. P. Wang, eds. Computational Biology and Genome Informatics. World Scientific, Singapore, 2003, ISBN 981-238-257-7.
D. Shasha and P. Bonnet. Database Tuning: Principles, Experiments, and Troubleshooting Techniques. Morgan Kaufmann, San Francisco, California, 2002, ISBN 1-55860-753-6.
G. Chang, M. J. Healey, J. A. M. McHugh and J. T. L. Wang. Mining the World Wide Web: An Information Search Approach. Kluwer Academic, Norwell, Massachusetts, 2001, ISBN 0-7923-7349-9.

Project Impact

As of today, users who have downloaded our graph or tree comparison software or search tools have come from academia and industry. Some sample affiliations include: Arizona State University, Carnegie Mellon University, Gateway, IBM, Indian Institute of Technology, McGill University, Microsoft, MIT, National Cancer Institute, National University of Singapore, Novartis Pharmaceuticals Corporation, University at Buffalo, University of Calgary, University of California, University of Michigan, agh.edu.pl, cse.unsw.edu.au, gen-strategies.com, geometricsoftware.com, infoglide.com, informatik.uni-tuebingen.de, jnana.com, law.kuleuven.ac.be, legacy2web.com, lsi.upc.es, mcculloch.ing.unifi.it, messagepharm.com, mrc-dunn.cam.ac.uk, scnsoft.com, si.ehu.es, synopsys.com, syseca.thomson-csf.com, uea.ac.uk, uow.edu.au, uq.edu.au, respectively. Our new tree searching algorithms have been integrated into the TreeBase system at Harvard University. This software allows the user to perform both structural and keyword searches in an efficient manner. In summary, we are providing a toolkit for discoveries at the frontiers, for data analysis in science but also for commerce.

Goals, Objectives and Targeted Activities

Our long-term goal is to provide scientists with efficient tools for approximate searches in structural databases. Our main applications and users so far have been in biochemistry and phylogenetics, but we are currently exploring possible applications in XML document management. In the coming year, we will focus on (i) improving and parallelizing our current tools for tree and graph searching so that they can be as fast as keyword search engines like google, (ii) developing practically meaningful distance measures on trees and graphs and approximate query processing algorithms to support inexact matching, and (iii) developing a framework for turning searching to pattern discovery in trees and graphs.

Area Background

Our project is concerned with efficient search in structural databases. Most of our work concentrates on combinatorial structures such as trees and graphs. Typical questions are "Given a query structure and a set of combinatorial structures, which combinatorial structure approximately contains the query structure?" We have developed efficient algorithms to answer these questions and are applying our techniques to phylogenetic trees and XML documents.

Area References

Related websites include http://dtp.nci.nih.gov/docs/3d_database/dis3d.html (The National Cancer Institute 3D Structure Database), http://www.rcsb.org/pdb/ (The Protein Data Bank), http://www.ncbi.nlm.nih.gov/Structure (The Structure Group at the National Center for Biotechnology Information), http://www.treebase.org/treebase/index.html (TreeBase), and http://www.w3.org/TR/xpath (XPath).

Potential Related Projects

Related projects on XML databases and computational biology conducted in California, Michigan, Pennsylvania, Texas, Washington, Wisconsin, etc. within the NSF IDM program can be found at the web site http://www.cise.nsf.gov/iis/.

Project Websites

http://web.njit.edu/~wangj/sigmod.html (a general introduction to the project)

Illustrations

http://web.njit.edu/~wangj/SIGMOD/demoposter_2000.html (a demo concerning 3D substructure search)
http://aria.njit.edu/biotool/search_index.html (a demo of a search engine for phylogenetic trees)

Online Software

http://cs.nyu.edu/cs/faculty/shasha/papers/tree.html (software for approximate tree comparison for ordered trees where sibling order matters)
http://cs.nyu.edu/cs/faculty/shasha/papers/treesearch.html (software for data structure setup for searching ordered and unordered trees)
http://www.cs.nyu.edu/cs/faculty/shasha/papers/cousins.html (software for finding frequent cousin patterns in unordered trees)
http://cs.nyu.edu/cs/faculty/shasha/papers/agm.html (software for approximate graph comparison)
http://cs.nyu.edu/cs/faculty/shasha/papers/graphgrep (software for graph searching)