Implement stochastic gradient descent (SGD) in your back propagation program
that you wrote in assignment 3. In the original SGD algorithm we update the 
gradient based on a single datapoint:

SGD algorithm:

Initialize random weights
for(k = 0 to n_epochs):
	Shuffle the rows (or row indices)
	for j = 0 to rows-1:
		Determine gradient using just the jth datapoint
		Update weights with gradient
	Recalculate objective

We will modify this into the mini-batch version and implement it for this
assignment.

I. Mini-batch SGD algorithm:

Initialize random weights
for(k = 0 to n_epochs):
	Shuffle the rows (or row indices)
	for j = 0 to rows/batch_size:
		Select k datapoints shuffled_data[j*k:(j+1)*k] where k is the batch size
		Determine gradient using just the selected k datapoints
		Update weights with gradient
	#Optional step
	Recalculate objective