CS698 Machine Learning Final Project (Spring 2011)
Administrative
Due date: May 3, 2011
Evaluation time & date: 2pm-4pm May 9 (Monday), 2011.
I may select students to discuss his/her results if I have any questions about
the submitted work.
Projects
You have
two options for the final project. Please choose one of them.
Option 1: Data analysis + short report
|
Given
1000 features (X1~ X1000), Y is generated using a
subset of them.
Each student has one common training dataset with 400 sample * 1000
feature (X1~ X1000) and label information y, and is assigned a testing dataset
with 100 sample * 1000
feature (X1~ X1000), without label information. |
Some tools for your reference:
1.
Weka Data Mining Software in Java, implement a lot of
software,
including svm, pca, boosting etc.
http://www.cs.waikato.ac.nz/~ml/weka/
2.SVM
(available in various languages)
http://www.support-vector-machines.org/SVM_soft.html
http://www.kernel-machines.org/software
SVM in R:
http://cran.r-project.org/src/contrib/Descriptions/e1071.html
Regularized SVM in R (SCAD & L1 SVM):
http://cran.r-project.org/web/packages/penalizedSVM/index.html
3.Logistic Regression
Logistic regression: glm(…,family="binomial",…)
function in R
Regularized Logistic Regression
written in R
http://cran.r-project.org/web/packages/penalized/index.html
Lasso (L1) in R and MATLAB
http://www-stat.stanford.edu/~tibs/lasso.html
4.
Dimension reduction: PCA package in R
http://rss.acs.unt.edu/Rdoc/library/pcaMethods/html/pca.html
in Matlab use function princomp and wmspca
5.
Boosting program
http://www.cs.princeton.edu/~schapire/boost.html
Boosting methods in R
http://rss.acs.unt.edu/Rdoc/library/boost/html/00Index.html
|
|
Requirement
Data
analysis:
a) Select features Xis that
cause y based on the training dataset
b) Build a
classifier using the training dataset and make prediction over the testing
dataset
c) You can use any available tools or choose to implement your own
analysis algorithms
Short Report:
-
Describe how
you conduct your analysis and your results obtained
-
State the features you
select to be causal
-
Make one
figure or table
-
Make one argument, comment or opinion
-
List three
references
-
<=500 words
|
|
Evaluation:
-
Recall and
precision of feature selection
as measured by F score
- Recall = #correctly selected features/#total causal features
-
Precision = #correctly selected features/#total selected features
-
F score = 2*Recall*Precision/(Recall + Precision)
-
Prediction accuracy
of the testing dataset
-
Report writing quality
|
What to submit:
- Analysis result file including feature selection and sample
prediction.
- Format: plain text file with the 1st line for feature selection
and 2nd line for sample prediction
- 1st line: a sequence of 1000
1s and/or 0s separated by space for X1~ X1000,
with 1 and 0 denoting selected and not-selected, respectively.
-
2nd line: a sequence of 100 predicted sample labels separated by
space for the 100 testing samples in the
given order.
-
Name your file by adding cs698_ as the prefix to the given testing
file name (cs698_#.txt), e.g. cs698_1.txt.
- Suppose you analyze
3.txt, a sample solution file that you
mean by select features X1, X4,X5, X6 and X11, and predict Sample
1~10, Sample 91~100 to be 1, and all the others as 0, is
cs698_3.txt.
- This part is graded by a computer program. It is your
responsibility to make your file in correct format so that it is
parsed and graded correctly.
- Report file
|
Common Training Data
Testing Data
|
Last Name |
First Name |
Testing Data |
|
Aunsri |
Nattapol |
1 |
|
Boston |
Daniel |
2 |
|
Boyd |
Justin |
3 |
|
Fay |
Brendan |
4 |
|
Fei |
Yi |
5 |
|
Guo |
Wen |
6 |
|
Hu |
Qingyang |
7 |
|
Hu |
Weicheng |
8 |
|
Khan |
Mohammad Ashrafuzzaman |
9 |
|
Lin |
Yuan |
10 |
|
Ma |
Xiguo |
11 |
|
Morrell |
Robert |
12 |
|
Poling |
David |
13 |
|
Roberts |
Andrew |
14 |
|
Wang |
Wei |
15 |
|
Xiong |
Wei |
16 |
|
XUE |
LONG |
17 |
|
YAN |
Zihua |
18 |
|
Ye |
Luhua |
19 |
Option 2: Write a review paper
|
Requirement
-
Write a review paper to summarize recent developments in machine
learning
-
Focus on one topic or sub-field, e.g., feature
selection, dimension reduction, tree-based methods, kernel based methods
(SVM), etc.
-
It should be related with the techniques that have been covered in class
-
2000 ~ 3000 words.
|
Note You can find papers from recent machine learning conferences/journals: ICML, KDD, IJCAI, AAAI, NIPS, ICDM,
etc. See the first lecture slides for a more complete list. |
|
Evaluation:
-
Paper
writing quality
|
What to submit:
|
|