CS 675

CS 675: Introduction to Machine learning

Instructor: Usman Roshan
Email: usman@njit.edu

Textbooks:
Introduction to Machine Learning by Ethem Alpaydin (Not required but strongly recommended)
Learning with kernels by Scholkopf and Smola (Recommended)
Foundations of Machine Learning by Rostamizadeh, Talwalkar, and Mohri (Recommended)

Grading: 20% mid-term, 30% final exam, 50% two course projects
Course Overview: This course is a hands-on introduction to machine learning and contains both theory and application. We will cover classification and regression algorithms in supervised learning such as naive Bayes, nearest neighbor, decision trees, random forests, linear regression, logistic regression, neural networks, and support vector machines. We will also cover dimensionality reduction, unsupervised learning (clustering), feature selection, kernel methods, hidden Markov models, gradient descent, big data methods, and representation learning. We will apply algorithms to solve problems on real data such as digit recognition, text document classification, and prediction of cancer and molecular activity.

Course plan:

Topic	Date	Notes
Introduction, Bayesian learning, and Python		Introduction Background Basic statistics More basic probability and statistics Applied statistics Linear algebra background More linear algebra Unix and login to NJIT machines Basic Unix command sheet Instructions for AFS login Textbook reading: All of chapter 1, 2.1, 2.4, 2.5, 2.6, 2.7
Bayesian learning		Bayesian learning Bayesian decision theory example problem Textbook reading: 4.1 to 4.5, 5.1, 5.2, 5.4, 5.5
Python		Python More on Python Python cheat sheet Python practice problems Python example 1 Python example 2 Python example 3
Nearest means and naive-bayes		Nearest mean algorithm Naive Bayes algorithm Practice problem 1 Predicted labels for naive bayes on breast cancer trainlabels.0 mean initialized to 0.01
Kernel nearest means		Nearest means in Python (part 1) Nearest means in Python (part 2) Datasets Balanced error Balanced error in Perl Kernels More on kernels Kernel nearest means Script to compute average test error Script to compute average test error Textbook reading: 13.5, 13.6, 13.7
Separating hyperplanes and least squares		Mean balanced cross-validation error on real data Hyperplanes as classifiers Least squares Textbook reading: 10.2, 10.3, 10.6, 11.2, 11.3, 11.5, 11.7 Project 1 Project 1 template code as a starting point Datasets format for project Linear data Non-linear data Breast cancer (bc.train.0) (bc.test.0) Ionosphere (ion.train.0) (ion.test.0) Climate simulation (climate.train.0) (climate.test.0) Qsar (qsar.train.0) (qsar.test.0) Hill valley (hill_valley.train.0) (hill_valley.test.0) Micromass (micromass.train.0) (micromass.test.0)
Multi-layer perceptrons		Multi-layer perceptrons Practice problem 2: Implement gradient descent for least squares Least squares output for toy data with seed=10 Predicted labels for least squares ionosphere trainlabels.0 training, eta=.0001, stop=.001 Objective values for least squares gradient descent on ionosphere trainlabels.0 training, eta=.0001, stop=.001 Least squares in Perl Approximations by superpositions of sigmoidal functions (Cybenko 1989) Approximation Capabilities of Multilayer Feedforward Networks (Hornik 1991) The expressive power of neural networks: A view from the width (Lu et. al. 2017)
Support vector machines		Textbook reading: 13.1 to 13.3 Support vector machines Practice problem 3: Implement hinge loss gradient descent Predicted labels for hinge loss on ionosphere trainlabels.0 training, eta=.001, stop=.001 Objective values for hinge loss gradient descent on ionosphere trainlabels.0 training, eta=.001, stop=.001 Efficiency of coordinate descent methods on huge-scale optimization problems Hardness of separating hyperplanes Learning Linear and Kernel Predictors with the 01 Loss Function
More on kernels		Kernels Multiple kernel learning by Lanckriet et. al. Multiple kernel learning by Gonen and Alpaydin
Logistic regression		Regularization and overfitting Logistic regression Textbook reading: 10.7 Practice problem 4: Implement logistic discrimination algorithm Predicted labels for logistic on climate trainlabels.0 training, eta=.001, stop=.001
Empirical and regularized risk minimization		Practice problem 5: Adaptive step size for hinge loss Empirical risk minimization Regularized risk minimization Solver for regularized risk minimization Advanced topics Convexity, classification, and risk bounds Does Distributionally Robust Supervised Learning Give Robust Classifiers? Curriculum Loss: Robust Learning and Generalization against Label Corruption Adversarial Machine Learning at Scale Revisiting Adversarial Risk Distributionally Robust Optimization: A Review Workshop on Distributionally Robust Optimization Efficient Stochastic Gradient Descent for Distributionally Robust Learning
Mid-term exam review		Midterm exam review sheet
Mid-term exam
Feature selection		Feature selection Feature selection (additional notes) NIPS 2003 feature selection contest Contest website A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets Feature selection with SVMs and F-score Ranking genomic causal variants with chi-square and SVM Feature selection exercise Training dataset Training labels Test dataset Python function to cross validate linear SVM C
Dimensionality reduction		Unsupervised dimensionality reduction Dimensionality reduction (additional notes) Proof of JL Lemma Random projections in dimensionality reduction Textbook reading: Chapter 6 sections 6.1, 6.3, and 6.6
Dimensionality reduction		Supervised dimensionality reduction Maximum margin criterion Laplacian linear discriminant analysis
Decision trees, bagging, boosting, and stacking		Decision trees, bagging, boosting, and stacking Decision trees (additional notes) Ensemble methods (additional notes) Practice problem 6: Implement a decision stump in Python Neural Network Ensembles Univariate vs. multivariate trees Gradient boosted trees: Slides by Tianqi Chen Textbook reading: Chapters 9 and 17 sections 9.2, 17.4, 17.6, 17.7
Ensemble methods, random projections, and stacking		Stacking Practice problem 7: Implement a bagged decision stump in Python
Regression		Regression Textbook reading: Chapter 4 section 4.6, Chapter 10 section 10.8, Chapter 13 section 13.10
Unsupervised learning - clustering		Clustering Practice problem 8: Implement k-means clustering in Python Tutorial on spectral clustering K-means via PCA Convergence properties of k-means Textbook reading: Chapter 7 sections 7.1, 7.3, 7.7, and 7.8
Clustering
Feature learning, representation learning		Extreme learning machines Random Bits Regression: a Strong General Predictor for Big Data Exploring classification, clustering, and its limits in a compressed hidden space of a single layer neural network with random weights Learning Feature Representations with K-means Analysis of single-layer networks in unsupervised feature learning On Random Weights and Unsupervised Feature Learning A k-means based feature learning method for protein sequence classification Feature learning with k-means Project 2 Predicted labels of ionosphere on trainlabels.0 in the new feature space of 10K features (error=5.5%) Results with random hyperplanes
Time series data, text document classification, and other topics		Time series methods Time series exercise Time series exercise Weekly sales transaction dataset Text encoding Python regular expressions Perl regular expressions Word tagging with nltk Semi-supervised and self-supervised classification Missing data (A study on missing data methods)
Hidden Markov models		Hidden Markov models Textbook reading: Chapter 15 (all of it)
Big data		Big data Mini-batch k-means Stochastic gradient descent Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent Mapreduce for machine learning on multi-core
Comparison of classifiers and big data, ROC, multiclass, statistical significance in comparing classifiers		Comparing classifiers ROC area under curve Multiclass (Multiclass: one vs all) Statistical signficance Comparison of classifiers Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? An Empirical Comparison of Supervised Learning Algorithms Statistical Comparisons of Classifiers over Multiple Data Sets
Robust machine learning		Classification boundaries(Code) Robust machine learning (notes in google drive folder) Extra credit assignment MNIST 0 vs 1 train (input to your program) MNIST 0 vs 1 trainlabels (input to your program) MNIST 0 vs 1 test MNIST 0 vs 1 testlabels MNIST 0 vs 1 fog corruption MNIST 0 vs 1 brightness corruption MNIST 0 vs 1 stripe corruption MNIST 0 vs 1 scale corruption MNIST 0 vs 1 translate corruption MNIST 0 vs 1 corruption labels
Final review		Review of most things covered in the course Final exam for review sheet
Final	TBA