The Aurora Management Workbench (AMW):A Toolkit for Building and Managing Distributed Software Applications

Ren Yansong
Lucent Technologies


Abstract

The Aurora Management Workbench (AMW) provides a software framework and tools for building and managing reliable distributed software applications. A unique combination of software libraries and code generation tools provide reusable and customized “reliability components”. These components are combined to implement many application-management functions, such as initialization and fault tolerance, which are traditionally hand crafted from scratch each time a new distributed software application is developed. Based on a high-level description of an application's management needs, the AMW tools automatically piece together and correctly configure the appropriate reliability components. Application components simply "plug in" to the AMW framework to acquire the management functionality they need. Using AMW, application developers implement only a small fraction of the management software traditionally required for distributed software applications – AMW provides all the rest. While most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation, an often-overlooked problem is dependably bringing a (distributed) system to a point where it can begin performing its duties – this is the task of initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. We have developed a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions during initialization are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures. In this talk, we first give a brief overview of the AMW framework and tools, and then discuss the dependable initialization problem and our solution in more detail.