Semi-supervised feature extraction for population structure identification using the
Laplacian linear discriminant
Abstract:
The identification of population structure from genome-wide SNP data is of significant
interest in the population and medical genetics community. A popular solution is to
perform unsupervised feature extraction using principal component analysis. Principal
component analysis, however, relies only on global properties of the data. The
Laplacian linear discriminant takes into consideration local properties of the data as
well and, as we show in this study, it can be extended to the semi-supervised setting.
This can then be applied to extract features for identifying population structure when
the ancestry of some individuals in some sub-populations of the admixture is known.
Using real data we simulate such semi-supervised scenarios and extract features using
the Laplacian linear discriminant, kernel principal component analysis, and two recent
semi-supervised feature extractors. We show that there is a statisti-cally significant
improvement in accuracy when the nearest mean classifier or k-means clustering is
applied on the Laplacian linear discriminant features compared to kernel principal
component analysis and the other methods.
U. Roshan, Semi-supervised feature extraction for population structure identification using the
Laplacian linear discriminant
Under revision