CS786 Machine Learning Projects

Administrative

Due date: March 28, 2013; I may meet with students to discuss his/her results if I have any questions about the submitted work.

Project1

Data analysis + short report

Given 1000 features (X₁~ X₁₀₀₀), Y is generated using a subset of them.

Each student has one training dataset with 400 sample * 1000 feature (X₁~ X₁₀₀₀) and label information y, and a testing dataset with 10,000 sample * 1000 feature (X₁~ X₁₀₀₀), without label information.
Training Data

Testing Data

Some tools for your reference:
1.      Weka Data Mining Software in Java, implement a lot of software,
including svm, pca, boosting etc.
http://www.cs.waikato.ac.nz/~ml/weka/

2.SVM (available in various languages)
http://www.support-vector-machines.org/SVM_soft.html
http://www.kernel-machines.org/software
SVM in R: http://cran.r-project.org/src/contrib/Descriptions/e1071.html
Regularized SVM in R (SCAD & L1 SVM):
http://cran.r-project.org/web/packages/penalizedSVM/index.html

3.Logistic Regression

Logistic regression: glm(…,family="binomial",…) function in R

Regularized Logistic Regression written in R
glmnet
Lasso (L1) in R and MATLAB
http://www-stat.stanford.edu/~tibs/lasso.html

4.      Dimension reduction: PCA package in R
http://rss.acs.unt.edu/Rdoc/library/pcaMethods/html/pca.html
in Matlab use function princomp and wmspca

5.      Boosting program
http://www.cs.princeton.edu/~schapire/boost.html
Boosting methods in R
http://rss.acs.unt.edu/Rdoc/library/boost/html/00Index.html

Requirement

Data analysis:

a) Select features X_is that cause y based on the training dataset

b) Build a classifier using the training dataset and make prediction over the testing dataset

c) You can use any available tools or choose to implement your own analysis algorithms

Short Report:

Describe how you conduct your analysis and your results obtained

State the features you select to be causal

Make one figure or table

Make one argument, comment or opinion

List three references

<=500 words

Evaluation:

Recall and precision of feature selection

Recall = #correctly selected features/#total causal features
Precision = #correctly selected features/#total selected features

Prediction accuracy of the testing dataset

Report writing quality

What to submit:

Analysis result file including feature selection and sample prediction.

File 1, with the name YourLastName_SelectedXs.txt, see a sample file here, meaning you select X4, X6, X8, X11, X16.
File 2, with the name YourLastName_PredictionY.txt, see a sample file here.
This part is graded by a computer program. It is your responsibility to make your file in correct format so that it is parsed and graded correctly.

Report file (hard copy and electronic copy)