Changing the Tide: Efficient Summarization Techniques for Massive Data

Jeffrey Jestes
University of Utah


Abstract

Given the recent explosion of data we have witnessed within the past few years, it has become apparent that many algorithms designed for these applications simply will not scale to huge amounts of data being generated. Due to the massive nature of modern data, it is often times infeasible for computers to efficiently manage and query it exactly. An attractive path for the future are data summarization techniques, which can be used to make data analytic tasks orders of magnitude more efficient while still allowing approximation guarantees on results. We argue in order to be useful, the summary must be efficient to construct and should be highly parallelizable. We present two such techniques which efficiently construct summaries over massive data; the first constructs wavelet histograms and the second constructs k-nearest neighbor graphs. Our techniques are demonstrated and evaluated in MapReduce, but are applicable for any parallel and distributed compute platforms. In addition to being able to efficiently construct summaries, we also motivate a vision for interactive and mergeable summaries as a direction for future research.