Data Integration with Uncertainty

Dr. Luna Dong
AT & T Research


Abstract

Many data management applications, such as managing enterprise data, scientific data, personal data, and integrating data on the web, need to manage a multitude of data sources. These data sets can be highly heterogeneous, describing the same domain using different schemas. To enable data sharing across heterogeneous sources, data integration systems specify a mediated schema, which provides an integrated and virtual view of the disparate sources, and build schema mappings from the source schemas to the mediated schema. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort and expertise in creating a mediated schema and semantic mappings between the schemas. We posit that data integration systems need to handle uncertainty on the semantics of data and do so in a principled fashion. This can be because there are too many schema mappings to be created and maintained, or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. For this purpose, we propose the new concepts of probabilistic schema mappings, probabilistic mediated schemas, and probabilistic functional dependencies. We analyze their formal foundations and describe how to automatically create them from data sources and use them in answering user's queries. Based on these concepts, we have built the first completely self-configuring data integration system. Our experiments show that the system can produce high-quality answers with no human intervention.