Feature Selection Algorithm for Classification of Simulated Genomic Dataset

    1. Simulated Genomic Dataset

    Dataset is a simulated genomic dataset containing about 30,000 single nucleotide polymorphisms (SNPs)
    Total Samples - 10000
    Number of classes - 2
    Train Size - 8000 (4000 controls and 4000 cases)
    Test Size - 2000 (1000 controls and 1000 cases)
    Feature Dimension - 32678

    2. Strategy


    Figure: Framework of the proposed method.

    (i) Apply the mRMR (Minimum Redundancy Maximum Relevance) algorithm to the train dataset to retain features that are relevant with respect to the classification variable. The mRMR algorithm helps to maximize the dependency between the joint distribution of the selected features and the classification variable.
    (ii) Apply the Recursive Feature Elimination (RFE) to further eliminate features that are redundant.
    (iii) The feature dimension after applying both the methods is reduced from 32678 to 10.
    (iv) After applying the feature selection algorithms, the new train data is used to train a SVM classifier with RBF kernel.

    3. Experiments

    (i) 10-Fold Cross Validation Classification Accuracy on Train Set


    (ii) Classification Accuracy on Test Set

    NOTE: The metric used for comparison is classification accuracy (%).
    DISR - Double Input Symmetrical Relevance
    CMIM - Conditional Mutual Information Maximization
    mRMR - Minimum Redundancy Maximum Relevance
    RFE - Recursive Feature Elimination

    View Source Code (Github)

    References

    [1] mRMR
    Hanchuan Peng, Fuhui Long and C. Ding, "Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
    [2] DISR
    P. E. Meyer, C. Schretter and G. Bontempi, "Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity," in IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 3, pp. 261-274, June 2008.
    [3] CMIM
    Fleuret, Francois. "Fast binary feature selection with conditional mutual information." Journal of Machine Learning Research 5.Nov (2004): 1531-1555.