CS 675

CS 675: Introduction to Machine learning
Summer 2019

Instructor: Usman Roshan
Office: GITC 4214B
Ph: 973-596-2872
Email: usman@njit.edu

Grader: Kalp Dalal
Email: kdd32@njit.edu

Textbooks:
Introduction to Machine Learning by Ethem Alpaydin (Not required but strongly recommended)
Learning with kernels by Scholkopf and Smola (Recommended)
Foundations of Machine Learning by Rostamizadeh, Talwalkar, and Mohri (Recommended)

Grading: 20% mid-term, 30% final exam, 10% course project, 40% programming assignments
Course Overview: This course is a hands-on introduction to machine learning and contains both theory and application. We will cover classification and regression algorithms in supervised learning such as naive Bayes, nearest neighbor, decision trees, random forests, linear regression, logistic regression, neural networks, and support vector machines. We will also cover dimensionality reduction, unsupervised learning (clustering), feature selection, kernel methods, hidden Markov models, gradient descent, big data methods, and representation learning. We will apply algorithms to solve problems on real data such as digit recognition, text document classification, and prediction of cancer and molecular activity.

Course plan:

Topic	Date	Notes
Introduction, Bayesian learning, and Python		Introduction Background Basic statistics More basic probability and statistics Applied statistics Linear algebra background More linear algebra Unix and login to NJIT machines Basic Unix command sheet Instructions for AFS login Textbook reading: All of chapter 1, 2.1, 2.4, 2.5, 2.6, 2.7
Bayesian learning		Bayesian learning Bayesian decision theory example problem Textbook reading: 4.1 to 4.5, 5.1, 5.2, 5.4, 5.5
Python		Python More on Python Python cheat sheet Python practice problems Python example 1 Python example 2 Python example 3
Nearest means and naive-bayes		Nearest mean algorithm Naive Bayes algorithm Assignment 1
Kernel nearest means		Nearest means in Python (part 1) Nearest means in Python (part 2) Datasets Balanced error Balanced error in Perl Kernels More on kernels Kernel nearest means Script to compute average test error Script to compute average test error Textbook reading: 13.5, 13.6, 13.7
Separating hyperplanes and least squares		Mean balanced cross-validation error on real data Hyperplanes as classifiers Least squares Textbook reading: 10.2, 10.3, 10.6, 11.2, 11.3, 11.5, 11.7
Multi-layer perceptrons		Multi-layer perceptrons Assignment 2: Implement gradient descent for least squares Predicted labels for least squares ionosphere trainlabels.0 training, eta=.0001, stop=.001 Least squares in Perl Approximations by superpositions of sigmoidal functions (Cybenko 1989) Approximation Capabilities of Multilayer Feedforward Networks (Hornik 1991) The expressive power of neural networks: A view from the width (Lu et. al. 2017)
Support vector machines		Textbook reading: 13.1 to 13.3 Support vector machines Assignment 3: Implement hinge loss gradient descent Predicted labels for hinge loss on ionosphere trainlabels.0 training, eta=.001, stop=.001 Efficiency of coordinate descent methods on huge-scale optimization problems Hardness of separating hyperplanes Learning Linear and Kernel Predictors with the 01 Loss Function
More on kernels		Kernels Multiple kernel learning by Lanckriet et. al. Multiple kernel learning by Gonen and Alpaydin
Logistic regression		Logistic regression Solver for regularized risk minimization Textbook reading: 10.7 Assignment 4: Implement logistic discrimination algorithm Predicted labels for logistic on climate trainlabels.0 training, eta=.001, stop=.001
Empirical and regularized risk minimization		Empirical risk minimization Regularized risk minimization Regularization and overfitting Solver for regularized risk minimization
Mid-term exam review		Midterm exam review sheet
Mid-term exam	07/09/10
Feature selection		Feature selection Feature selection (additional notes) NIPS 2003 feature selection contest Contest website Challenge results Challenge results II A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets Feature selection with SVMs and F-score Ranking genomic causal variants with chi-square and SVM
Dimensionality reduction		Unsupervised dimensionality reduction Dimensionality reduction (additional notes) Proof of JL Lemma Textbook reading: Chapter 6 sections 6.1, 6.3, and 6.6 Course project Training dataset Training labels Test dataset
Dimensionality reduction		Supervised dimensionality reduction Maximum margin criterion Laplacian linear discriminant analysis Assignment 5: Adaptive step size for least squares and hinge
Decision trees, bagging, boosting, and stacking		Decision trees, bagging, boosting, and stacking Decision trees (additional notes) Ensemble methods (additional notes) Assignment 6: Implement a decision stump in Python Univariate vs. multivariate trees Gradient boosted trees: Slides by Tianqi Chen Textbook reading: Chapters 9 and 17 sections 9.2, 17.4, 17.6, 17.7
Ensemble methods, random projections, and stacking		Stacking Random projections in dimensionality reduction Assignment 7: Implement a bagged decision stump in Python
Regression		Regression Textbook reading: Chapter 4 section 4.6, Chapter 10 section 10.8, Chapter 13 section 13.10
Unsupervised learning - clustering		Clustering Assignment 8: Implement k-means clustering in Python Tutorial on spectral clustering K-means via PCA Convergence properties of k-means Textbook reading: Chapter 7 sections 7.1, 7.3, 7.7, and 7.8
Clustering
Clustering
Feature learning		Random Bits Regression: a Strong General Predictor for Big Data Learning Feature Representations with K-means Analysis of single-layer networks in unsupervised feature learning On Random Weights and Unsupervised Feature Learning Feature learning with k-means Assignment 9 (optional extra credit) Random hyperplanes Results with random hyperplanes
Hidden Markov models		Hidden Markov models Textbook reading: Chapter 15 (all of it)
Big data		Big data Mini-batch k-means Stochastic gradient descent Mapreduce for machine learning on multi-core
Comparison of classifiers and big data, ROC, multiclass, statistical significance in comparing classifiers		Comparing classifiers ROC area under curve Multiclass Statistical signficance Comparison of classifiers Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? An Empirical Comparison of Supervised Learning Algorithms Statistical Comparisons of Classifiers over Multiple Data Sets
Time series data, text document classification, and other topics		Time series methods Text encoding Weekly sales transaction dataset (Time series contest) Semi-supervised and self-supervised classification Missing data (A study on missing data methods)
Some advanced topics and papers		Classification boundaries(Code) Convolutional neural networks for image recognition Gradient based learning applied in document recognition Representation learning Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition (Cover 1965) ImageNet classification with deep neural networks (Krizhevsky et. al. 2012) Random projections preserve margin Random projections preserve margin II Python Image Library
Final review		Review of most things covered in the course Final exam for review sheet
Final	TBA