Semi-supervised feature extraction for population structure identification using the Laplacian linear discriminant

Abstract: The identification of population structure from genome-wide SNP data is of significant interest in the population and medical genetics community. A popular solution is to perform unsupervised feature extraction using principal component analysis. Principal component analysis, however, relies only on global properties of the data. The Laplacian linear discriminant takes into consideration local properties of the data as well and, as we show in this study, it can be extended to the semi-supervised setting. This can then be applied to extract features for identifying population structure when the ancestry of some individuals in some sub-populations of the admixture is known. Using real data we simulate such semi-supervised scenarios and extract features using the Laplacian linear discriminant, kernel principal component analysis, and two recent semi-supervised feature extractors. We show that there is a statisti-cally significant improvement in accuracy when the nearest mean classifier or k-means clustering is applied on the Laplacian linear discriminant features compared to kernel principal component analysis and the other methods.

U. Roshan, Semi-supervised feature extraction for population structure identification using the Laplacian linear discriminant Under revision