CS698 Machine Learning Final Project (Spring 2011)


Administrative

Due date: May 3, 2011

Evaluation time & date: 2pm-4pm May 9 (Monday), 2011. I may select students to discuss his/her results if I have any questions about the submitted work.

Projects

You have two options for the final project. Please choose one of them.

Option 1: Data analysis + short report

Given 1000 features (X1~ X1000), Y is generated using a subset of them.

Each student has one common training dataset with 400 sample * 1000 feature (X1~ X1000) and label information y, and is assigned a testing dataset with 100 sample * 1000 feature (X1~ X1000), without label information.

 

Some tools for your reference:

1.      Weka Data Mining Software in Java, implement a lot of software,
including svm, pca, boosting etc.
http://www.cs.waikato.ac.nz/~ml/weka/

2.SVM (available in various languages)
http://www.support-vector-machines.org/SVM_soft.html
http://www.kernel-machines.org/software
SVM in R: http://cran.r-project.org/src/contrib/Descriptions/e1071.html
Regularized SVM in R (SCAD & L1 SVM):
http://cran.r-project.org/web/packages/penalizedSVM/index.html

3.Logistic Regression

Logistic regression: glm(…,family="binomial",…) function in R

Regularized Logistic Regression written in R
http://cran.r-project.org/web/packages/penalized/index.html
Lasso (L1) in R and MATLAB
http://www-stat.stanford.edu/~tibs/lasso.html

4.      Dimension reduction: PCA package in R
http://rss.acs.unt.edu/Rdoc/library/pcaMethods/html/pca.html
in Matlab use function princomp and wmspca

5.      Boosting program
http://www.cs.princeton.edu/~schapire/boost.html
Boosting methods in R
http://rss.acs.unt.edu/Rdoc/library/boost/html/00Index.html

 

Requirement

 Data analysis:

a) Select features Xis that cause y based on the training dataset

b) Build a classifier using the training dataset and make prediction over the testing dataset

c) You can use any available tools or choose to implement your own analysis algorithms

Short Report:

  • Describe how you conduct your analysis and your results obtained

  • State the features you select to be causal

  • Make one figure or table

  • Make one argument, comment or opinion

  • List three references

  • <=500 words

 

Evaluation:

  1. Recall and precision of feature selection as measured by F score

    • Recall = #correctly selected features/#total causal features
    • Precision = #correctly selected features/#total selected features
    • F score = 2*Recall*Precision/(Recall + Precision)
  2. Prediction accuracy of the testing dataset

  3. Report writing quality

 
What to submit:
  • Analysis result file including feature selection and sample prediction.
    • Format: plain text file with the 1st line for feature selection and 2nd line for sample prediction
    • 1st line: a sequence of 1000 1s and/or 0s separated by space for X1~ X1000, with 1 and 0 denoting selected and not-selected, respectively.
    • 2nd line: a sequence of 100 predicted sample labels separated by space for the 100 testing samples in the given order.
    • Name your file by adding cs698_ as the prefix to the given testing file name (cs698_#.txt), e.g. cs698_1.txt.
    • Suppose you analyze 3.txt, a sample solution file that you mean by select features X1, X4,X5, X6 and X11, and predict Sample 1~10, Sample 91~100 to be 1, and all the others as 0,  is cs698_3.txt.
    • This part is graded by a computer program. It is your responsibility to make your file in correct format so that it is parsed and graded correctly.
  • Report file

Common Training Data

Testing Data

 Last Name

First Name

 Testing Data

Aunsri

Nattapol

 1

Boston

Daniel

 2

Boyd

Justin

 3

Fay

Brendan

 4

Fei

Yi

 5

Guo

Wen

 6

Hu

Qingyang

 7

Hu

Weicheng

 8

Khan

Mohammad Ashrafuzzaman

 9

Lin

Yuan

 10

Ma

Xiguo

 11

Morrell

Robert

 12

Poling

David

 13

Roberts

Andrew

 14

Wang

Wei

 15

Xiong

Wei

 16

XUE

LONG

 17

YAN

Zihua

 18

Ye

Luhua

 19

Option 2: Write a review paper

Requirement

  • Write a review paper to summarize recent developments in machine learning

  • Focus on one topic or sub-field, e.g., feature selection, dimension reduction, tree-based methods, kernel based methods (SVM), etc.

  • It should be related with the techniques that have been covered in class

  • 2000 ~ 3000 words.

 
Note

You can find papers from recent machine learning conferences/journals: ICML, KDD, IJCAI, AAAI, NIPS, ICDM, etc. See the first lecture slides for a more complete list.

Evaluation:

  1. Paper writing quality

What to submit:
  • Review paper file