Automated Testing and Debugging for Data-centric Software

Muhammad Ali Gulzar
University of California Los Angeles


Abstract

Data-intensive scalable computing (DISC) systems such as MapReduce, Google FlumeJava, and Apache Spark are commonly used today to process terabytes of data. At this scale, rare and buggy corner cases frequently show up in production, leading to a crash after running for days or, worse, silently producing corrupted output. Unfortunately, in this domain, “testing on a random” sample rarely guarantees the reliability and “printf” debugging methods are expensive. In this talk, I will describe the insights behind techniques that make automated debugging and testing feasible for data-centric software. First, I will present BigDebug and BigSift that redesign interactive and automated debugging primitives tailored for data-centric software. I will show how we leverage ideas from systems and database research to reduce the debugging time by half and perform precise root-cause analysis in a fraction of the job execution time. Second, I will discuss BigTest that systematically explores dataflow program paths and automatically generates test data that is orders of magnitude smaller yet several times more effective in revealing critical bugs. Finally, I will conclude with a broader vision of designing productivity toolkits to support the growing needs of data-centric software in ML, AI, and data science.