Filecules and Small Worlds in Scientific Communities: Characteristics and Significance

Dr. Adriana Iamnitchi, Assistant Professor
Department of Computer Science and Engineering, University of South Florida


Abstract

Most of today's science depends on the processing of massive amounts of data in multi-institutional and even international collaborations. Grid computing focuses on enabling resource sharing for wide-area collaborations and has currently reached the stage where deployments are mature and many collaborations run in production mode. As with any growing technology, Grid usage characteristics (that inherently affect performance) could not have been predicted before or during design and implementation. This lack of evidence in usage characteristics has three significant outcomes: (1) resource management solutions are evaluated on irrelevant traces; (2) quantitative comparison of alternative solutions to the same problem becomes impossible due to different experimental assumptions and synthetically generated workloads; and (3) solutions are designed in isolation, to fit the particular and possibly transitory needs of specific groups. These concerns led us to analyze more than two years of workloads from a high-energy physics collaboration. In addition to contradicting previously accepted models, we discovered two novel data-usage patterns. First, a data-centric analysis reveals the existence of "filecules", groups of files that are always processed together. Second, a user-centric analysis discovers small-world properties in data sharing that show emergent, interest-based grouping of users. We show that exploiting these patterns for designing resource management solutions leads to better scalability, lower costs, and increased adaptability to changing environments.