Querying Incomplete and Inconsistent Web Databases

As more and more information from autonomous databases becomes available to lay users, integrating and querying these databases must adapt to deal with the imprecise nature of user queries as well as incompleteness and inconsistency in the data due to missing attribute values (aka "null values") and dirty values. In such scenarios, the query processor begins to acquire the role of a recommender system. Specifically, in addition to presenting answers which satisfy the user's query, the query processor is expected to provide highly relevant answers even though they do not exactly satisfy the query predicates.

This broadened view of query processing and autonomous nature of web databases pose many new challenges:

  • How to measure result relevance as a combined effect of query imprecision and data incompleteness and inconsistency?

  • How to win user's trust to the query results and their ranks generated by the mediator?

  • How to infer the relevance of an incomplete tuple with respect to a user query given the restricted data access privileges imposed on web data sources and the limited support for query patterns?

  • How to detect dirty values in the data and infer their relevance to a user query?

  • How to achieve efficiency given the bounded pool of database and network resources in the web environment?

 

 

To tackle these challenges, we have developed a suite of techniques as outlined below.

  • We introduce a novel query rewriting and optimization framework that retrieves relevant possible tuples with missing values or dirty values on the query predicates without modifying the web databases. Our technique involves reformulating the user query based on mined correlations among the database attributes.

  • We develop techniques to gauge the relevance of the rewritten queries allowing tradeoffs in reducing the costs of database query processing and answer transmission.

  • We develop techniques to infer the actual values from missing values. This involves our proposed methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naive Bayes Classifiers), and selectivity estimates.

  • We develop techniques to detect and infer dirty values by building data generative models and error models.

  • We propose a decision theoretic model for ranking answers in the in the order of their expected relevance to the user. This model combines a relevance function that reflects the relevance a user would associate with answer tuples and a density function which reflects the each tuple's distribution of missing data.

 

 

 

Faculty:

            Yi Chen < yi.chen at njit dot edu >

            Subbarao Kambhampati < rao at asu dot edu >

Students:

             Aravind Kalavagattu

             Bhaumik Chokshi

             Garrett Wolf

             Hemal Khatri1

             Jianchun Fan

             Raju Balakrishnan