Machine Learning for Computational Biology

Dr Yanjun Qi
NEC Labs American


The field of computational biology has seen great growth over the past few years in terms of newly available data and new scientific questions. Thus new challenges have arisen for machine learning since biological data is often relationally structured and highly diverse. In this talk, I will introduce two of our previous efforts to meet these challenges. (I). The first half is about protein-protein interactions (PPIs) which are critical for virtually every biological function. Recently, researchers suggested to use supervised information integration for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins ({\em labeled}). Meanwhile there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction ({\em partially labeled}). We propose a semi-supervised multi-task framework for predicting PPIs from not only {\em labeled}, but also {\em partially labeled} reference sets. Multiple ways of utilizing semi-supervision are proposed in this framework. Our method is shown to improve the identification of interacting pairs between HIV-1 and human proteins. (2). The second half of the talk is about a unified architecture we proposed for the task of predicting local protein properties based on sequence modeling alone. A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Most previous approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. By training a deep neural network architecture in a multitask fashion, our model obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.