The 2nd International Workshop on Data Reduction for Big Scientific Data (DRBSD-2)

 

Abstract

A growing disparity between simulation speeds and I/O rates makes it increasingly infeasible for high-performance applications to save all results for offline analysis. By 2024, computers are expected to compute at 1018 ops/sec but write to disk only at 1012 bytes/sec: a compute-to-output ratio 200 times worse than on the first petascale systems. In this new world, applications must increasingly perform online data analysis and reduction—tasks that introduce algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design of various elements of exascale systems.

This trend has spurred interest in high-performance online data analysis and reduction methods, motivated by a desire to conserve I/O bandwidth, storage, and/or power; increase accuracy of data analysis results; and/or make optimal use of parallel platforms, among other factors.  This requires our community to understand a clear yet complex relationships between application design, data analysis and reduction methods, programming models, system software, hardware, and other elements of a next-generation High Performance Computer, particularly given constraints such as applicability, fidelity, performance portability, and power efficiency.

There are at least three important topics that our community is striving to answer: (1) whether several orders of magnitude of data reduction is possible for exascale sciences; (2) understanding the performance and accuracy trade-off of data reduction; and (3) solutions to effectively reduce data while preserving the information hidden in large scientific data.  Tackling these challenges requires expertise from computer science, mathematics, and application domains to study the problem holistically, and develop solutions and hardened software tools that can be used by production applications.

The goal of this workshop is to provide a focused venue for researchers in all aspects of data reduction and analysis to present their research results, exchange ideas, identify new research directions, and foster new collaborations within the community.

Topics of interest include but are not limited to:

• Application use-cases which can drive the community to develop MiniApps

• Data reduction methods for scientific data including:

  • Data deduplication methods
  • Motif-specific methods (structured and unstructured meshes, particles, tensors, …)
  • Optimal design of data reduction methods
  • Methods with accuracy guarantees

• Metrics to measure reduction quality and provide feedback

• Data analysis and visualization techniques that take advantage of the reduced data

• Hardware and data co-design

• Accuracy and performance trade-offs on current and emerging hardware

• New programming models for managing reduced data

• Runtime systems for data reduction

Keynote Talks

1. SKA: The Data Domino Enabled by DALiuGE, Andreas Wicenec, the University of Western Australia

Abstract:

The Square Kilometre Array (SKA) will pose interesting new challenges on the way scientific computing is carried out. The processing will require to connect the antenna arrays in South Africa and Australia to dedicated 200PF scale HPC centres over some 700km WAN connections. Some part of the on-line calibration and transient detection will be carried out on a (sub) second cadence on data streams of about 1TB/s. The further processing will first collect all the data from a single 6-12 hour long observation and then perform an iterative image reconstruction and ‘cleaning’ on that data set. With current algorithms the bottleneck seems to be in memory bandwidth, but in addition the level of data parallelism and inherent concurrency reaches quite extreme levels with several tens of millions of tasks and data items to be scheduled and managed during a single image reconstruction run. The design of the SKA processing system thus includes an execution framework detailing the baseline concepts of an architecture enabling the processing at SKA scale. Along with working on the architecture and detailed design of this execution framework, we have also implemented a prototype to prove the viability of the proposed design decisions and extract the actual requirements for the ‘final’, operational execution framework system. This agile process quite naturally exposed quite a number of existing potential candidate frameworks, technologies and concepts, which are well established in the Big Data and HPC communities. We have carefully analysed these candidate technologies, but deliberately stayed independent of any of the complete frameworks in order to arrive with a ‘vendor’ neutral design and set of requirements. The result of the prototyping work is called DALiuGE, which stands for 'Data Activated Flow Graph Engine’. DALiuGE implements most of the concepts required to perform the various radio astronomy workflows, while almost completely avoiding any unnecessary features. While fully driven by radio astronomy, DALiuGE is still completely generic and can be adopted to any kind of similar workflow problems. This talk will highlight the key concepts and solutions of DALiuGE and also present the results of test runs at scale.

Short Bio:

Andreas Wicenec is Professor at the University of Western Australia since 2010, leading the Data Intensive Astronomy Program of the International Centre for Radio Astronomy Research designing and implementing data flows and high performance scientific computing for large scale astronomical facilities and surveys. During his career he had the privilege to be involved in the software development, data management and reduction and operation of several large scale astronomical facilities, including the ESA cornerstone HIPPARCOS satellite, the Very Large Telescope (VLT) and the Atacama Large Millimetre and Submillimetre Array (ALMA) in Chile, the Murchison Widefield Array (MWA), the Fivehundred metre Aperture Spherical Telescope (FAST) and the Square Kilometre Array (SKA). Prof. Wicenec is also involved in the International Virtual Observatory Alliance (IVOA). His scientific interests in astronomy include precision global astrometry, optical background radiation, stellar photometry, dynamics and evolution of planetary nebulae and observational survey astronomy. In computer science he is doing research in workflow construction and execution as well as scheduling and the related computational concepts.

2. Facing the Big Data Challenge in the Fusion Code XGC, CS Chang, Princeton Plasma Physics Laboratory

Abstract:

Boundary plasma of a magnetic fusion reactor is far from a thermodynamic equilibrium, with the physics dominated by nonlinear multiscale multiphysics interactions in a complicated geometry, and requires extreme-scale computing for first-principles based understanding.  The modern scalable particle-in-cell code XGC has been developed for this purpose, in partnership with the computer science and applied mathematics communities over the last decade. The bigger the computer is, the more complete physics can be contained in XGC.   XGC’s extreme scale capability has been recognized by being award a few hundred million hours of computing time from all US leadership class computers, and by being selected into all three pre-exascale or exascale programs: CAAR at OLCF, NESAP at NERSC, and AURORA ESP at ALCF.  The physics data size produced from a 1-day XGC run of ITER plasma on the present ~20PF computer is ~100PB, which is much above the limit imposed by the present technology.  We are losing most of the valuable physics data in order to keep the data flow within the limits imposed by the I/O rate and the file system size.  Since the problem size will increase in proportion to the parallel computer capability, the challenge will grow at least 100-fold as the exascale computers arrive.  Reduction of the data size by several orders of magnitude is required that can still preserve the accuracy to enable various levels of scientific discoveries.  On-the-fly in-memory data analysis and visualization must occur at the same time.  These issues, as well as the necessity to collaborate tightly with the applied mathematics and computer science communities, will be discussed from the application driver point of view.

Short Bio:

C.S. Chang has extensive experience in successfully leading large-scale, multi-institutional, multi-disciplinary teams composed of fusion energy scientists, computer scientists, and applied mathematicians; which include the Proto-Type Fusion Simulation Project for Plasma Edge Simulation, SciDAC-2 Center for Plasma Edge Simulation (CPES), SciDAC-3 Center for Edge Plasma Simulation (EPSI), and the new SciDAC-4 Partnership Center for High-fidelity Boundary Plasma Simulation (XBP).  C.S. Chang is a Fellow of the American Physical Society, and has been serving in many national and international leadership roles, which includes chairing the recent DOE ASCR/FES Exascale Requirement Review activities. He has given numerous invited and plenary talks, keynote speeches, and tutorial lectures at major international conferences, and has supervised more than 20 Ph.D. dissertations.

 

Tentative Workshop Agenda

8:30 Welcome and opening remark

8:30 - 9:10 Keynote Talk

The Data Domino Enabled by DALiuGE, Andreas Wicenec,  The University of Western Australia

9:10 - 10:10 Papers (20 mins each)

Sheng Di, Dingwen Tao and Franck Cappello. An Efficient Approach to Lossy Compression with Pointwise Relative Error Bound

Benjamin Welton and Barton Miller. Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

Mark Ainsworth, Ozan Tugluk and Ben Whitney. MGARD: A Multilevel Technique for Compression of Floating-Point Data

 

10:10 - 10:20 Break

10:20 - 11:00 Keynote talk

Facing the Big Data Challenge in the Fusion Code XGC, CS Chang, Princeton Plasma Physics Lab

11:00 - 12:00 Papers (20 mins each)

Swati Singhal and Alan Sussman. Adaptive Compression to Improve I/O Performance for Climate Simulations

Guénolé Harel, Jacques-Bernard Lekien and Philippe Pébaÿ. Lean Visualization of Large Scale Tree-Based AMR Meshe

Kenny Gruchalla, Nicholas Brunhart-Lupo, Kristin Potter and John Clyne. Contextual Compression of Large-Scale Wind Turbine Array Simulations

 

Organizing Committee

Scott Klasky, Oak Ridge National Laboratory

Gary Liu, New Jersey Institute of Technology

Mark Ainsworth, Brown University/Oak Ridge National Laboratory

Ian Foster, Argonne National Laboratory/University of Chicago

 

Technical Program Committee

Frank Cappello, Argonne National Laboratory

Peter Lindstrom, Lawrence Livermore National Laboratory

Todd Munson, Argonne National Laboratory

Kerstin Van Dam, Brookhaven National Laboratory

George Ostrouchov, Oak Ridge National Laboratory

Scott Klasky, Oak Ridge National Laboratory

Mark Ainsworth, Brown University/Oak Ridge National Laboratory

John Wu, Lawrence Berkeley National Laboratory

Todd Munson, Argonne National Laboratory

Eric Suchyta, Oak Ridge National Laboratory

Martin Burtscher, Texas State University

 

Call for Papers

The 2nd International Workshop on Data Reduction for Big Scientific Data (DRBSD-2)

in Conjunction with SC’17

Nov 17th, 2017

Denver, CO

 

https://web.njit.edu/~qliu/drbsd2.html

Link to the SC'17 technical program

As the speed gap between compute and storage continues to exist and widen, the increasing data volume and velocity pose major challenges for big data applications in terms of storage and analysis. This demands new research and software tools that can further reduce data by several orders of magnitude, taking advantage of new architectures and hardware available on next generation systems. This international workshop on data reduction is a response to this renewed research direction and will provide a focused venue for researchers in this area to present their research results, exchange ideas, identify new research directions, and foster new collaborations within the community.

Topics of interest include but are not limited to:

• Application use-cases which can drive the community to develop MiniApps

• Data reduction methods for scientific data including:

  • Data deduplication methods
  • Motif-specific methods (structured and unstructured meshes, particles, tensors, …)
  • Optimal design of data reduction methods
  • Methods with accuracy guarantees

• Metrics to measure reduction quality and provide feedback

• Data analysis and visualization techniques that take advantage of the reduced data

• Hardware and data co-design

• Accuracy and performance trade-offs on current and emerging hardware

• New programming models for managing reduced data

• Runtime systems for data reduction

 

Important Dates

Paper Deadline: September 15th, 2017 (AoE)

Author Notification: September 30th, 2017

 

Submissions

Papers should be submitted electronically on Easychair (https://easychair.org/conferences/?conf=drbsd2).

• Paper submission must be in IEEE format.

http://www.ieee.org/conferences_events/conferences/publishing/templates.html

• Paper submissions are required to be within 5 pages excluding references.

Submitted papers will be evaluated by at least 3 reviewers based upon technical merits.