Discriminative Learning Methods For Analyzing Genome-wide Association Studies

Usman Roshan
Computer Science Department, NJIT


Abstract

Genome-wide association studies aim to identify genes and regions of the genome that are associated with disease or any other given phenotype. Such studies present several challenging problems that require high performance computing and machine learning methods. To list a few: (1) how does one detect genes that are truly associated with the disease or phenotype and reject false positives? (2) how does one select genes that best predict risk of disease and which method is the best for predicting disease risk? (3) and how does one detect genes that interact with each other to form a complex pathway? A standard solution to detect such regions is to rank them by chi-square statistic p-values and consider the top ones for further study. As a discriminative alternative we have studied a ranking strategy based upon the popular support vector machine (SVM) method. We demonstrate empirically on simulated data that this strategy ranks causal regions higher than the chi-square statistic provided that sufficient false positives are removed in advance. We also show that top ranked regions given by the SVM yield higher disease risk prediction accuracy on both simulated and real data. In this talk I will present our experimental results and highlight advantages and limitations of the SVM for ranking genomic regions. I will also present an overview of our other work related to solving the problems stated above.