SVM and random forest SNPs

Ranking causal SNPs and disease associated regions in genome wide association studies by the support vector machine and random forest
Abstract: We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-square statistic, support vector machine (SVM), and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square ranked SNPs, where r is the number of SNPs with p-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver is available at http://svmsnps.njit.edu.

Online Supplementary Material

Webserver
Software
Simulated data:
- General performance on different relative risks
- Performance as a function of sample size
  - data_RR1.25_2000.tgz (relative risk 1.25 and 2000 case and controls each)
  - data_RR1.25_4000.tgz (relative risk 1.25 and 4000 case and controls each)
- Performance on causal allele frequencies at most 5%
- Power study:
- Disease risk prediction on simulated data:
- HapMap CEU phased genotypes used in simulating above data: Illumina300KCEU.tgz
Previously replicated SNPs in WTCCC studies (from Evans et. al. 2009, Hum. Mol. Gen. 18(18), 3525-3531)
Citation: U. Roshan, S. Chikkagoudar, Z. Wei, K. Wang, H. Hakonarson, Ranking causal variants and disease associated regions in genome wide association studies by the support vector machine and random forest, Nucleic Acids Research, 2011 ( PDF)