Data Quality Management and Mining

Laure Berti-Équille
AT&T Labs Research


Data quality problems occur frequently and are easily propagated in every databases and warehousing system, affecting every application domain and decision making process. Almost all real-life datasets contain missing, duplicate, out-of-range, inconsistent or incorrect values. As a consequence, any data management, integration or KDD task requires intense and complex data preparation and cleaning processes of scrubbing the data to avoid misleading, incorrect and biased results. However, the current practices of data preparation and cleaning are usually one-shot, ad hoc, rule-based, and programmatic approaches. Most importantly, they usually focus on one single type of data glitch in isolation. In this talk, we show that data glitches of different types occur concomitantly in real-life datasets. Due to the structure of the processes that generate these data, the glitches have complex and multivariate interactions and also mutual masking effects. The focus on the talk is thus to introduce you to data quality management and mining with recent techniques and methods for measuring and improving the quality of data in various application domains.