A Common Framework for Mining Motifs in Diverse Data

Laxmi Parida, Computational Biology Center
IBM T J Watson Research Center


Abstract

The research community is inundated with data such as the genome sequences of various organisms, microarray data and so on, of biological origin. This data-volume is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. The sheer enormity calls for a systematic approach to understanding this using computational methods. As a first step towards making sense out of the data, we study the regularities in the data and hypothesize that this reveals vital information towards greater understanding of biological systems. The talk will focus on various kinds of regularities in data, that we identify and devise methods for unsupervised (automatic) discovery. For genomic data that is one dimensional (genome or protein sequences), we identify string and permutation patterns; more generally, we define 2D patterns, association patterns for microarray data and network motifs for metabolic pathways. We focus on a few of these problems: we will give the mathematical definitions, present some interesting theoretical results and their implications in practice. I will conclude the talk with a brief discussion on our work with using the general approach in the study of (1) gene proximity analysis and (2) protein folding trajectory analysis.