Perform feature selection on a simulated dataset of single nucleotide polymorphism (SNP) genotype data containing 29623 SNPs (total features). Amongst all SNPs are 15 causal ones which means they and neighboring ones discriminate between case and controls while remainder are noise. In the training are 4000 cases and 4000 controls. Your task is to predict the labels of 2000 test individuals. Both datasets and labels are immediately following the link for this project file. The training dataset is called traindata.gz (in gzipped format), training labels are in trueclass, and test dataset is called testdata.gz (also in gzipped format). You may use cross-validation to evaluate the accuracy of your method and for parameter estimation. Your project must be in Python. You cannot use numpy or scipy except for numpy arrays as given below. You may use the support vector machine, logistic regression, naive bayes, linear regression and dimensionality reduction modules but not the feature selection ones. These classes are available by importing the respective module. For example to use LinearSVC we do from sklearn import LinearSVC You may also make system calls to external C programs for classification such as svmlight, liblinear, fest, and bmrm. Memory issues: One challenge with this project is the size of the data and loading it into RAM. Floats and numbers take up more than 4 bytes in Python because everything is really an object (a struct in C) that contain other information besides the value of the number. To reduce the space we can use the array class of Python. Type "from array import array" in the beginning of your program. Suppose we have a list of n float called l. This will take more space than 4l bytes. To make it space efficient create a new array called l2 = array('f', l). The new array l2 can be treated pretty much like a normal list except that it will take 4l bytes (as is done in C or C++). You may also use numpy arrays for efficient storage but no other numpy methods for array manipulations. Your program would take as input the training dataset, the trueclass label file for training points, and the test dataset. The output would be a prediction of the labels of the test dataset in the same format as in the class assignments. The score of your output is measured by accuracy/(#number of features). In order to qualify for full points you would need to achieve an accuracy of at least 63%.