Perform feature selection on a simulated dataset of single nucleotide 
polymorphism (SNP) genotype data containing 29623 SNPs (total features). 
Amongst all SNPs are 15 causal ones which means they and neighboring ones 
discriminate between case and controls while remainder are noise.

In the training are 4000 cases and 4000 controls. Your task is to predict 
the labels of 2000 test individuals.

Both datasets and labels are immediately following the link for this
project file. The training dataset is called traindata.gz (in gzipped
format), training labels are in trueclass, and test dataset is called
testdata.gz (also in gzipped format).

You may use cross-validation to evaluate the accuracy of your method and for 
parameter estimation. 

Your project must be in Python. You cannot use numpy or scipy except for 
numpy arrays as given below. You may use the support vector machine, 
logistic regression, naive bayes, linear regression and dimensionality 
reduction modules but not the feature selection ones. These classes are 
available by importing the respective module. For example to use 
LinearSVC we do

from sklearn import LinearSVC

You may also make system calls to external C programs for classification
such as svmlight, liblinear, fest, and bmrm.

Memory issues:

One challenge with this project is the size of the data and loading it into 
RAM. Floats and numbers take up more than 4 bytes in Python because 
everything is really an object (a struct in C) that contain other 
information besides the value of the number. To reduce the space we can use 
the array class of Python.

Type "from array import array" in the beginning of your program. 
Suppose we have a list of n float called l. This will take more space 
than 4l bytes. To make it space efficient create a new array called l2 
= array('f', l). The new array l2 can be treated pretty much like a 
normal list except that it will take 4l bytes (as is done in C or C++).

You may also use numpy arrays for efficient storage but no other numpy 
methods for array manipulations.

Your program would take as input the training dataset, the 
trueclass label file for training points, and the test dataset. 
The output would be a prediction of the labels of the test dataset in the 
same format as in the class assignments. 

The score of your output is measured by accuracy/(#number of features). 
In order to qualify for full points you would need to achieve an accuracy
of at least 63%.