K-means-based Feature Learning for Protein Sequence Classification
Abstract:
Protein sequence classification has been a major challenge in bioinformatics and related fields for
some time and remains so today. Due to the complexity and volume of protein data, algorithmic
techniques such as sequence alignment are often unsuitable due to time and memory constraints.
Heuristic methods based on machine learning are the dominant technique for classifying large sets of
protein data. In recent years, unsupervised deep learning techniques have garnered significant
attention in various domains of classification tasks, but especially for image data. In this study, we
adapt a k-means-based deep learning approach that was originally developed for image classification to
classify protein sequence data. We use this unsupervised learning method to preprocess the data and
create new feature vectors to be classified by a traditional supervised learning algorithm such as
SVM. We find the performance of this technique to be superior to that of the spectrum kernel and
empirical kernel map, and comparable to that of slower distance matrix-based approaches.
Citation:
Paul Melman and Usman Roshan, A k-means based feature learning method for protein sequence classification,
accepted to BICOB 2018 (local link to paper)