CS 634: Data Mining
Fall 2023

Instructor: Usman Roshan
Office: GITC 4415
Office Hours:M 3:15-3:55, T 11:30-12:10, W 3:15-3:55, Th 3:15-3:55
Ph: 973-596-2872
Email: usman@njit.edu

Textbooks:
None required
Reference: Introduction to Machine Learning by Ethem Alpaydin
Grading: 35% exam, 65% course project
Course Overview: This course will focus on methods to find patterns on data. We will start with unsupervised methods like k-means clustering, other clustering methods, and neural network representation learning methods. We will then look at text mining, recommender systems, self-supervised learning, and supervised methods.

Course plan:

Topic
Date
Notes
Introduction
Introduction
Projects
Unsupervised learning - k-means clustering and other clustering methods
Clustering
Clustering (pdf)

K-means via PCA
Convergence properties of k-means
Running k-means in Python scikit-learn
Scikit learn clustering
Scikit learn k-means
k-means in Python scikit-learn
Breast cancer training
Breast cancer test
Data visualization with principal component analysis
Data visualization with PCA
Data visualization with PCA (pdf)
Dimensionality reduction through eigenvectors
PCA example

t-SNE paper

Scikit learn linear PCA
Scikit learn t-SNE
PCA and t-SNE in Python scikit-learn

Supervised data visualization
Supervised learning - linear models and support vector machines
Nearest means classifier
Nearest means - effect of outliers
When does nearest means succeed and fail

Linear models
Least squares notes
Least squares gradient descent algorithm

Regularization

Scikit learn linear models
Scikit learn support vector machines
SVM in Python scikit-learn
Breast cancer training
Breast cancer test
Linear data
Non linear data

Categorical variables
One hot encoding in scikit-learn

Multiclass classification
One-vs-all method
Cross validation and balanced accuracy
Cross validation
Training vs. validation accuracy
Balanced error
Representation learning - neural networks and autoencoders
Multilayer perceptrons
Basic single hidden layer neural network
Back propagation

Approximations by superpositions of sigmoidal functions (Cybenko 1989)
Approximation Capabilities of Multilayer Feedforward Networks (Hornik 1991)
The Power of Depth for Feedforward Neural Networks (Eldan and Shamir2016)
The expressive power of neural networks: A view from the width (Lu et. al. 2017)

Scikit-learn MLPClassifier
Scikit-learn MLP code
Keras multilayer perceptron on tabular data

Generative modeling
Autoencoder
Text mining - word and document representations, regular expressions
Text document encoding and classification
Word2Vec paper

Python regular expressions
Basic regular expressions in Python
Spam train
Spam test
Classifying spam vs non-spam documents
Exam review sheet
Exam review sheet
Self-supervised learning - learning text and image representations
Unsupervised feature learning and image retrieval with deep networks
Recommender systems - vector-space modeling, document and image similarity search, searching in representation space
Deep learning recommender systems
Netflix recommender system
Project presentations
Nandini, Anoushka, Avanthika, Dhruv
Project presentations
Amulya, Manikanta, Manideep, Balaji, Srini