Domain Knowledge based Mining of Large Network and Sequential Data

Bijaya Adhikari
Virginia Tech


Abstract

Which locations and staff should we monitor in order to detect pathogen outbreaks in hospitals? How do we predict the peak intensity of the influenza incidence in an interpretable fashion? How do we infer the states of all nodes in a critical infrastructure network where failures have occurred? Leveraging domain-based information should make it possible to answer these questions. However, several new challenges arise. First challenge is (a) presence of more complex dynamics. For example, some hospital pathogen spread via both people-to-people and surface-to-people interactions and correlations between failures in critical infrastructures go beyond the network structure and depend on the geography as well. Traditional approaches rely on restrictive propagation models which cannot capture the complex nature of the dynamics, resulting in sub-optimality. The next challenge pertains to (b) data sparsity. Although, there has been a significant increase in data collection in domains like public health and urban computing, the data sparsity still persists. For example, CDC has been collecting and releasing influenza incidence data each week since 1997. However, that results in only a handful of time-series. The third challenge is (c) mismatch between data and the process. In many situations, the underlying dynamical process is either unknown or actually depends on a mixture of several models and the data collected is just a proxy to the actual process. In such cases, methods which generalize well to unobserved (or unknown) models are required. Current approaches often fail in tackling these challenges as they either rely on restrictive models, or require large volumes of data, or work only for predefined models. Here I propose to leverage domain-based frameworks, which include novel models and analysis techniques, and domain-based low dimensional representation learning to tackle the challenges above for networks and time-series mining tasks. By developing novel frameworks, one can capture the complex dynamics accurately and analyze them more efficiently. At the same time, learning low-dimensional domain-aware embeddings capture domain specific properties more efficiently from sparse data, which is useful for subsequent tasks. Similarly, since domain-aware embeddings capture the model information directly from the data without any modeling assumptions they generalize better to new models. The domain-aware frameworks and embeddings I develop enable many applications in critical domains. For example, our domain-aware framework for C. Difficile outbreak detection has more than 95% accuracy. Similarly, our framework for product recommendation in E-commerce for queries with sparse engagement data results in a 34% improvement over the current Walmart.com search engine. Additionally, by exploiting domain-aware embeddings, we outperform non-trivial competitors by up to 40% for influenza forecasting. Similarly, domain-aware representations of subgraphs helped us outperform non-trivial baselines by up to 68% in the graph classification task.