DS 644: Big Data
Fall 2023

Instructor: Usman Roshan
Office: GITC 4415
Office Hours:M 4-4:40, T 12:15-12:55, W 4-4:40, Th 4-4:40
Ph: 973-596-2872
Email: usman@njit.edu

Textbooks:
None required
Reference: Introduction to Machine Learning by Ethem Alpaydin
Grading: 35% mid-term, 65% course project
Course Overview: This course is a hands-on introduction to big data methods. We will study machine and deep learning methods for high dimensional data and for large number of samples. We will also look at parallel and distributed methods for big data.

Course plan:

Topic
Date
Notes
Introduction
Introduction
Projects
Machine learning - linear regression and support vector machines
Linear models
Least squares notes
Least squares gradient descent algorithm

Regularization
Machine learning - running linear models in Python scikit-learn
Scikit learn linear models
Scikit learn support vector machines
SVM in Python scikit-learn
Breast cancer training
Breast cancer test
Linear data
Non linear data
Cross validation and balanced accuracy
Cross validation
Training vs. validation accuracy
Balanced error
Deep learning - neural networks
Multilayer perceptrons
Basic single hidden layer neural network
Back propagation

Approximations by superpositions of sigmoidal functions (Cybenko 1989)
Approximation Capabilities of Multilayer Feedforward Networks (Hornik 1991)
The Power of Depth for Feedforward Neural Networks (Eldan and Shamir 2016)
The expressive power of neural networks: A view from the width (Lu et. al. 2017)

Convolution and single layer neural networks objective and optimization
Softmax and cross-entropy loss
Relu activation single layer neural networks objective and optimization
Multi layer neural network objective and optimization.pdf

Image localization and segmentation
Deep learning - running neural networks in Scikit-learn
Scikit-learn MLPClassifier
Scikit-learn MLP code
Multiclass classification - linear models and neural networks
Multiclass classification
Different multiclass methods
One-vs-all method
Tree-based multiclass
Multiclass neural network softmax objective
Deep learning - running neural networks in Keras on tabular data
Categorical variables
One hot encoding in scikit-learn
Keras multilayer perceptron on tabular data
Keras multilayer perceptron on tabular data with feature spaces
Deep learning - Convolutions and image classification
Image classification code
Convolutions
Popular convolutions in image processing

Convolutions (additional notes)
Convolutions - example 1
Convolutions - example 2
Convolutions - example 3
Convolutions - example 4
Popular convolutions in image processing

Convolutional neural network (Additional slides by Yunzhe Xue)
Convolution and single layer neural networks objective and optimization
Training and designing convolutional neural networks

Flower image classification with CNNs code
More deep learning
Optimization in neural networks
Stochastic gradient descent pseudocode
Stochastic gradient descent (original paper)

Image classification code v2

Batch normalization
Batch normalization paper
How does batch normalization help optimization

Gradient descent optimization
An overview of gradient descent optimization algorithms

On training deep networks
The Loss Surfaces of Multilayer Networks

Common architectures

Transfer learning by Yunzhe Xue
Transfer learning in Keras
Pre-trained models in Keras

Understanding data augmentation for classification
SMOTE: Synthetic Minority Over-sampling Technique
Dataset Augmentation in Feature Space
Improved Regularization of Convolutional Neural Networks with Cutout
Document classification
Text document encoding and classification
Word2Vec paper
Word2Vec follow-up paper
Word2Vec illustration

Python regular expressions
Basic regular expressions in Python
Spam train
Spam test
Classifying spam vs non-spam documents

Fake news data (WordCNN code)
Dimensionality reduction
Data visualization with PCA
Data visualization with PCA (pdf)
Dimensionality reduction through eigenvectors
PCA example

t-SNE paper

Scikit learn linear PCA
Scikit learn t-SNE
PCA and t-SNE in Python scikit-learn

Supervised data visualization
Exam review sheet
Exam review sheet
Generative modeling and autoencoders
Generative modeling
Autoencoder
Parallel and distributed methods - mapreduce, GPUs, multicore CPUs
Mapreduce for machine learning on multi-core

GPU coding
Parallel chi-square 2-df test
Chi-square 2-df test in parallel on a GPU

CUDA to OpenCL slides
libOpenCL.so (NVIDIA library file for OpenCL code)
Chi2 opencl implementation
OpenCL files

CUDA to OpenMP slides
OpenMP reference
Chi2 openmp implementation
Project presentations
Hannah, Meet, Kunal, Harshad, Tanmay, Adit, Jhilmit, Mehnaz, Anish, Kiran, Yashwanth, Mukul
Project presentations
Agustin, Priyam, Charan, Dhruv, Amulya, Kiran, Aakash, Rahul Charan, Sai Kiran, Sai Teja, Buddula, Venkata, Subba, Mohan, Yeshwanth