CS698 Machine Learning Final Project (Spring 2011)

Administrative

Due date: May 3, 2011

Evaluation time & date: 2pm-4pm May 9 (Monday), 2011. I may select students to discuss his/her results if I have any questions about the submitted work.

Projects

You have two options for the final project. Please choose one of them.

Option 1: Data analysis + short report

Given 1000 features (X₁~ X₁₀₀₀), Y is generated using a subset of them.

Each student has one common training dataset with 400 sample * 1000 feature (X₁~ X₁₀₀₀) and label information y, and is assigned a testing dataset with 100 sample * 1000 feature (X₁~ X₁₀₀₀), without label information.

Some tools for your reference:
1.      Weka Data Mining Software in Java, implement a lot of software,
including svm, pca, boosting etc.
http://www.cs.waikato.ac.nz/~ml/weka/

2.SVM (available in various languages)
http://www.support-vector-machines.org/SVM_soft.html
http://www.kernel-machines.org/software
SVM in R: http://cran.r-project.org/src/contrib/Descriptions/e1071.html
Regularized SVM in R (SCAD & L1 SVM):
http://cran.r-project.org/web/packages/penalizedSVM/index.html

3.Logistic Regression

Logistic regression: glm(…,family="binomial",…) function in R

Regularized Logistic Regression written in R
http://cran.r-project.org/web/packages/penalized/index.html
Lasso (L1) in R and MATLAB
http://www-stat.stanford.edu/~tibs/lasso.html

4.      Dimension reduction: PCA package in R
http://rss.acs.unt.edu/Rdoc/library/pcaMethods/html/pca.html
in Matlab use function princomp and wmspca

5.      Boosting program
http://www.cs.princeton.edu/~schapire/boost.html
Boosting methods in R
http://rss.acs.unt.edu/Rdoc/library/boost/html/00Index.html

Requirement

Data analysis:

a) Select features X_is that cause y based on the training dataset

b) Build a classifier using the training dataset and make prediction over the testing dataset

c) You can use any available tools or choose to implement your own analysis algorithms

Short Report:

Describe how you conduct your analysis and your results obtained

State the features you select to be causal

Make one figure or table

Make one argument, comment or opinion

List three references

<=500 words

Evaluation:

Recall and precision of feature selection as measured by F score

Recall = #correctly selected features/#total causal features
Precision = #correctly selected features/#total selected features
F score = 2*Recall*Precision/(Recall + Precision)

Prediction accuracy of the testing dataset

Report writing quality

What to submit:

Analysis result file including feature selection and sample prediction.

Format: plain text file with the 1st line for feature selection and 2nd line for sample prediction
1st line: a sequence of 1000 1s and/or 0s separated by space for X₁~ X_1000, with 1 and 0 denoting selected and not-selected, respectively.
2nd line: a sequence of 100 predicted sample labels separated by space for the 100 testing samples in the given order.
Name your file by adding cs698_ as the prefix to the given testing file name (cs698_#.txt), e.g. cs698_1.txt.
Suppose you analyze 3.txt, a sample solution file that you mean by select features X1, X4,X5, X6 and X11, and predict Sample 1~10, Sample 91~100 to be 1, and all the others as 0, is cs698_3.txt.
This part is graded by a computer program. It is your responsibility to make your file in correct format so that it is parsed and graded correctly.

Report file

Common Training Data

Testing Data

Last Name	First Name	Testing Data
Aunsri	Nattapol	1
Boston	Daniel	2
Boyd	Justin	3
Fay	Brendan	4
Fei	Yi	5
Guo	Wen	6
Hu	Qingyang	7
Hu	Weicheng	8
Khan	Mohammad Ashrafuzzaman	9
Lin	Yuan	10
Ma	Xiguo	11
Morrell	Robert	12
Poling	David	13
Roberts	Andrew	14
Wang	Wei	15
Xiong	Wei	16
XUE	LONG	17
YAN	Zihua	18
Ye	Luhua	19

Option 2: Write a review paper

Requirement

Write a review paper to summarize recent developments in machine learning

Focus on one topic or sub-field, e.g., feature selection, dimension reduction, tree-based methods, kernel based methods (SVM), etc.

It should be related with the techniques that have been covered in class

2000 ~ 3000 words.

Note
You can find papers from recent machine learning conferences/journals: ICML, KDD, IJCAI, AAAI, NIPS, ICDM, etc. See the first lecture slides for a more complete list.

Evaluation:

Paper writing quality

What to submit:

Review paper file