XSEEK : Intelligent Search Engine for Semi-Structured Data

Information search is an indispensable component of our lives. Web search engines, such as Google, Yahoo! and Bing, are widely used for searching textual documents, images, and video. However, there are also vast collections of structured and semi-structured data both on the Web and in enterprises, such as relational databases, XML data, etc. The classical way of accessing these data sources is through issuing structured queries, such as SQL/XPath/XQuery. However, this demands users to learn these query languages and comprehend the possibly complex and fast-evolving data schema, which is inconvenient or impossible for users in many applications. To relieve web and scientific users from the learning curve and enable them to easily access (semi-)structured and semi-structured data, supporting keyword search on such data is highly desirable.


We have been identifying a spectrum of problem space in the domain of supporting keyword search on semi-structured data, ranging from evaluation framework of various search strategies, generating high-quality results, helping users to analyze results, user behavior analysis, to scalability using parallel processing framework.

We have been developing techniques to address these problems for users to achieve better search quality and enhanced search experience than searching unstructured data (e.g. web pages), by exploiting the rich meta-information embedded in (semi-) structured data, as outlined below. In addition to developing techniques to address the above challenges of processing keyword queries on tree-structured and graph-structured data in general, we have adapted and extended the techniques for processing a wide variety of semi-structured data, including workflows, text-rich semi-structured data such as online forums, data with incomplete or dirty information, data with varying degree of trustworthy levels, as well as archived data with temporal information.  

  • Initiating an axiomatic framework to evaluate the quality of XML search engines (pdf).  Using axioms for quality evaluation is a cost-effective and easy to use complement to the traditional empirical evaluation through user-study which often involves a long time and extensive work from domain experts and users.

  • Identifying explicit relevant nodes (among all nodes that match keywords) (pdf) and implicit relevant nodes (among all nodes that do not match keywords) (pdf, pdf) for keyword search on XML. This is the first work on identifying implicit relevant nodes in structured data search.

  • Composing ranking-friendly results from explicit and implicit relevant nodes for keyword search on XML trees (pdf, pdf) and on graph data (link). We propose a result composition approach driven by inferring user search targets using the Information Theory. We propose that each result should be atomic and intact: having exactly one target instance along with all associated evidence, so that ranking and top-k query processing can be meaningfully based on target instances.

  • Initiating the first study of processing keyword searches on workflow hierarchies (pdf, pdf). We define an informative, self-contained and concise search result on workflows to be a projection of a workflow hierarchy on a two dimensional viewing plane inferred from user queries. We then design and develop an efficient keyword search engine for workflows. Experimental evaluation demonstrates the effectiveness of our approach.

  • Initiating the first study of differentiating search results on structured (relational or XML) data (pdf, pdf, link). Due to the absence of general tools that can effectively analyze and differentiate multiple results, a user has to manually read and comprehend potentially large results in an exploratory search. We proposed an efficient and effective way that helps users for result comparison and differentiation.

  • Initiating the first study of generating query-biased snippets for keyword search on XML data (pdf, pdf, link)

  • Initiating the first study of query-dependent result clustering on structured data. We propose algorithms for automatically generating a user-specified number of describable clusters given a set of query results (link). Furthermore, we developed query expansion techniques to assist users to issue queries based on clustered results (pdf).

  • Initiating the first study of answering keyword queries on XML data using materialized views (link, pdf). Materialized views, or generally, semantic caching, have been proved successful for performance optimization in evaluating structured queries on XML and databases. We investigated the feasibility and present a general framework for answering XML keyword searches using materialized views.

  • Investigating techniques for supporting keyword search on dirty web relational databases (pdf, pdf, link). For web data, a search engine sometimes may serve as a mediator of multiple web databases. Due to volume and velocity of data or access restrictions, it is impractical to create a local copy of the data and clean it offline. We studied the problem of how to rewrite a user keyword query to queries that can be answered by individual web databases, whose query results are expected to be highly relevant to the original user query taken into consideration of dirty data. To the best of our knowledge, our work is the first to study this problem.

  • Investigating techniques for processing keyword queries on large-scale data set. Existing research for keyword search on semi-structured focus on small data sets which can fit in memory. We are investigating how to leverage parallel processing on multiple computers to handle large data sets (pdf).

  • Investigating techniques for supporting keyword searches on temporal graphs. Archiving graph data over history is demanded in many applications, such as social network studies, collaborative projects, scientific graph databases, and bibliographies. Typically people are interested in querying temporal graphs beyond merely retrieving a snapshot. We addressed an open question of supporting keyword-based queries on temporal graph (link).





            Yi Chen < yi.chen at njit dot edu >

PhD Students:

             Mingda Li < ml456 at njit dot edu >


             Chong Wang, Yi Shan, Yunzhong Liu, Norman Hamilton, Brandon Ruggles, Ahmed Youssef, Nicholas Devlin, Brian Ackerman,Doug Stoeckmann, Stephen Booher, Yichuan Cai, Ziyang Liu, Tim Meehan, SivaramaKrishnan Natarajan, Peng Sun, Jeffrey Walker, Arthur Maciejewicz, Mike Citro, Ghislain Youdom, Viraj Bhalala.


This project is supported by NSF CAREER Award 1322406, a Google Research Award and a Google Award for Google Cloud Platform Credit.