Ranking causal SNPs and disease associated regions in genome wide
association studies by the support vector machine and random forest
Abstract:
We study the number of causal variants and associated regions
identified by top SNPs in rankings given by the popular 1 df
chi-square statistic, support vector machine (SVM), and the random
forest (RF) on simulated and real data. If we apply the SVM
and RF to the top 2r chi-square ranked SNPs, where r is the
number of SNPs with p-values within the Bonferroni correction,
we find that both improve the ranks of causal variants and
associated regions and achieve higher power on simulated data.
These improvements, however, as well as stability of the SVM
and RF rankings, progressively decrease as the cutoff increases
to 5r and 10r. As applications we compare the ranks of previously
replicated SNPs in real data, associated regions in type 1
diabetes, as provided by the Type 1 Diabetes Consortium, and
disease risk prediction accuracies as given by top ranked SNPs
by the three methods. Software and webserver is available at
http://svmsnps.njit.edu.
Citation: U. Roshan, S. Chikkagoudar, Z. Wei, K. Wang, H. Hakonarson,
Ranking causal variants and disease associated regions in genome wide
association studies by the support vector machine and random forest,
Nucleic Acids Research, 2011 (
PDF)