Given
1000 features (X1~ X1000), Y is generated using a
subset of them.
Each student has one training dataset with 400 sample * 1000
feature (X1~ X1000) and label information y, and a testing dataset
with 10,000 sample * 1000
feature (X1~ X1000), without label information.
Training Data
Testing Data |
Some tools for your reference:
1.
Weka Data Mining Software in Java, implement a lot of
software,
including svm, pca, boosting etc.
http://www.cs.waikato.ac.nz/~ml/weka/
2.SVM
(available in various languages)
http://www.support-vector-machines.org/SVM_soft.html
http://www.kernel-machines.org/software
SVM in R:
http://cran.r-project.org/src/contrib/Descriptions/e1071.html
Regularized SVM in R (SCAD & L1 SVM):
http://cran.r-project.org/web/packages/penalizedSVM/index.html
3.Logistic Regression
Logistic regression: glm(…,family="binomial",…)
function in R
Regularized Logistic Regression
written in R
glmnet
Lasso (L1) in R and MATLAB
http://www-stat.stanford.edu/~tibs/lasso.html
4.
Dimension reduction: PCA package in R
http://rss.acs.unt.edu/Rdoc/library/pcaMethods/html/pca.html
in Matlab use function princomp and wmspca
5.
Boosting program
http://www.cs.princeton.edu/~schapire/boost.html
Boosting methods in R
http://rss.acs.unt.edu/Rdoc/library/boost/html/00Index.html
|
Requirement
Data
analysis:
a) Select features Xis that
cause y based on the training dataset
b) Build a
classifier using the training dataset and make prediction over the testing
dataset
c) You can use any available tools or choose to implement your own
analysis algorithms
Short Report:
-
Describe how
you conduct your analysis and your results obtained
-
State the features you
select to be causal
-
Make one
figure or table
-
Make one argument, comment or opinion
-
List three
references
-
<=500 words
|
Evaluation:
-
Recall and
precision of feature selection
- Recall = #correctly selected features/#total causal features
-
Precision = #correctly selected features/#total selected features
-
Prediction accuracy
of the testing dataset
-
Report writing quality
|
What to submit:
- Analysis result file including feature selection and sample
prediction.
- File 1, with the name YourLastName_SelectedXs.txt, see
a sample file here,
meaning you select X4, X6, X8, X11, X16.
- File 2, with the name YourLastName_PredictionY.txt, see
a sample file here.
- This part is graded by a computer program. It is your
responsibility to make your file in correct format so that it is
parsed and graded correctly.
- Report file (hard copy and electronic copy)
|