CS786 Machine Learning Projects


Administrative

Due date: March 28, 2013; I may meet with students to discuss his/her results if I have any questions about the submitted work.

Project1

Data analysis + short report

Given 1000 features (X1~ X1000), Y is generated using a subset of them.

Each student has one training dataset with 400 sample * 1000 feature (X1~ X1000) and label information y, and a testing dataset with 10,000 sample * 1000 feature (X1~ X1000), without label information.

Training Data

Testing Data

 

Some tools for your reference:

1.      Weka Data Mining Software in Java, implement a lot of software,
including svm, pca, boosting etc.
http://www.cs.waikato.ac.nz/~ml/weka/

2.SVM (available in various languages)
http://www.support-vector-machines.org/SVM_soft.html
http://www.kernel-machines.org/software
SVM in R: http://cran.r-project.org/src/contrib/Descriptions/e1071.html
Regularized SVM in R (SCAD & L1 SVM):
http://cran.r-project.org/web/packages/penalizedSVM/index.html

3.Logistic Regression

Logistic regression: glm(…,family="binomial",…) function in R

Regularized Logistic Regression written in R
glmnet
Lasso (L1) in R and MATLAB
http://www-stat.stanford.edu/~tibs/lasso.html

4.      Dimension reduction: PCA package in R
http://rss.acs.unt.edu/Rdoc/library/pcaMethods/html/pca.html
in Matlab use function princomp and wmspca

5.      Boosting program
http://www.cs.princeton.edu/~schapire/boost.html
Boosting methods in R
http://rss.acs.unt.edu/Rdoc/library/boost/html/00Index.html

 

Requirement

 Data analysis:

a) Select features Xis that cause y based on the training dataset

b) Build a classifier using the training dataset and make prediction over the testing dataset

c) You can use any available tools or choose to implement your own analysis algorithms

Short Report:

  • Describe how you conduct your analysis and your results obtained

  • State the features you select to be causal

  • Make one figure or table

  • Make one argument, comment or opinion

  • List three references

  • <=500 words

 

Evaluation:

  1. Recall and precision of feature selection

    • Recall = #correctly selected features/#total causal features
    • Precision = #correctly selected features/#total selected features
  2. Prediction accuracy of the testing dataset

  3. Report writing quality

 
What to submit:
  • Analysis result file including feature selection and sample prediction.
    • File 1, with the name YourLastName_SelectedXs.txt, see a sample file here, meaning you select X4, X6, X8, X11, X16.
    • File 2, with the name YourLastName_PredictionY.txt, see a sample file here.
    • This part is graded by a computer program. It is your responsibility to make your file in correct format so that it is parsed and graded correctly.
  • Report file (hard copy and electronic copy)