When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Important: Failure to do any of the following will result in lost points:
Submit one PDF file and one notebook per group
Enter all the group members in your PDF Gradescope submission, not just in your documents. Look at this Piazza Post if you don't know how to do this
Match each answer (not question!) with the appropriate page in Gradescope
Avoid large blank spaces in your PDF file
Except for part 4, you may only import the numpy
, matplotlib
, and, if needed math
modules for this assignment.
Let
$$ \mathbf{x} = [1, 0, 2, -1, 3]^T \;\;\;\text{and}\;\;\; \mathbf{g} = [-1, 2, 1]^T\;. $$
Give the valid correlation $\mathbf{z}_v$ and the shape-preserving correlation $\mathbf{z}_s$ of signal $\mathbf{x}$ with kernel $\mathbf{g}$.
Using the values in the previous problem, give the matrices $V_v$ and $V_s$ such that
$$ \mathbf{z}_v = V_v \mathbf{x} \;\;\;\text{and}\;\;\; \mathbf{z}_s = V_s \mathbf{x} \;. $$
Let
$$ X = \left[\begin{array}{rrrrr} 2 & 1 & 3 & -1 & 4 \\ 0 & -2 & 1 & 0 & 3 \end{array}\right] \;\;\;\text{and}\;\;\; G = \left[\begin{array}{rr} 1 & 2 \\ -3 & 0 \end{array}\right]\;. $$
Give the valid correlation $Z_v$ and the shape-preserving correlation $Z_s$ of image $X$ with kernel $G$.
Give the image $Z_v$ as defined in the previous problem, except that stride 2 is used.
The code below implements a generic Layer
class for a neural network, and a fully connected layer class FCLayer
.
This is a rather stripped down and somewhat inefficient implementation, introduced here for pedagogical purposes. This implementation is described next. Please examine the code as you read the description. It is unlikely that this description will make sense to you unless you have read and understood the class notes on the architecture of neural nets and on back-propagation.
The function tensorize
is introduced as a way to cope with the poor way in which numpy
handles matrix products, now that the matrix
data type is being phased out. Tensorizing arguments passed to the various methods of the implementation ensures that they are floating-point numpy
arrays with no singleton dimensions. Singleton dimensions are array shape components equal to 1. For instance, the second dimension of the array np.zeros((3, 1))
is a singleton dimension. The array is equal to array([[0.], [0.], [0.]])
and tensorize
converts it to np.zeros(3)
, which is equal to array([0., 0., 0.])
. With this change, the operation np.dot(a, b)
is equivalent to matrix multiplication of a
and b
whenever a
and b
are tensorized and their matrix product is well defined (after tensorization). This is sufficient for working with Jacobians.
The Layer
class implements methods that all layers either use as is or override. These methods include __init__
, which requires a minimum of methods for initialization, reset
, which does nothing in its generic version, getWeights
and setWeights
, which respectively retrieve or set all the parameters in the layer, and dfdw
, which computes the Jacobian of the output of the layer relative to its parameters.
The method dfdw
is provided as a default for layers that have no parameters. In that case, the method returns an empty array. Layers with parameters must override this method.
The attribute parms
of a layer is a list of parameter arrays. For instance, a FCLayer
stores the gain matrix V
and bias vector b
in its parms
attribute. Regardless of the number and shapes of the parameters in parms
, the method getWeights
always returns the parameters as a single vector, obtained by concatenating the flattened versions of all the parameters. For instance, if V
is a $2\times 3$ array and b
is a vector with two biases, then getWeights
returns a single vector with 8 values.
The attribute x
is initially set to None
. It is important to note that every layer remembers its input in this attribute. In this way, when the methods dfdw
and dfdx
are called to compute the Jacobian of a layer's output relative to either parameter vectors w
or input x
, the value of the input is known.
The method setWeights
takes a vector of parameters, assumed to have the correct number of elements, and reshapes it into a list of appropriately sized entries, which it binds to the parms
attribute of the layer.
The class FCLayer
implements the affine part of a fully connected layer (that is, it does not include the nonlinearity, which is a separate layer). This class overrides __init__
, dfdw
, and reset
, and provides an additional method ofShape
.
The __init__
method of FCLayer
takes arguments V
and b
(a gain array and a bias vector) and initializes the layer. It also defines a method f
, which takes a vector x
and returns the output of the layer, and a method dfdx
, which returns the Jacobian of the layer output relative to its input. If there are $e$ outputs and $d$ inputs, the Jacobian is an $e\times d$ array, as in the class notes. As mentioned above, f
stores its input in the layer's x
attribute. The __init__
method of FCLayer
calls the generic Layer.__init__
method, to ensure consistency across layers.
The ofShape
method allows initializing a fully-connected layer by specifying only the number m
of inputs and the number n
of outputs. The call FCLayer.ofShape(m, n)
initializes V
to an $n\times m$ array of random numbers, and b
to a $n$-dimensional vector of zeros.
The reset
method resets all parameters as if the ofShape
method were called.
The dfdw
method is a function that computes the Jacobian of the layer's output relative to its weights. If there are $e$ outputs and $k$ weights, the Jacobian is an $e\times k$ array, as in the class notes.
Study this code before you solve the problems in this part, as these problems ask you to implement additional layers in the same style.
import numpy as np
def tensorize(x):
return np.squeeze(np.asfarray(x))
class Layer:
def __init__(self, parms, f, dfdx):
self.parms = [tensorize(p) for p in parms]
self.f = f
self.dfdx = dfdx
self.x = None
def reset(self, r=None):
self.x = None
def getWeights(self):
if len(self.parms) == 0:
return []
else:
return np.concatenate([p.flatten() for p in self.parms])
def setWeights(self, w):
if len(w) > 0:
w = tensorize(w)
for k in range(len(self.parms)):
s = self.parms[k].shape
n = 1 if len(s) == 0 else np.prod(s)
self.parms[k] = np.reshape(w[:n], s)
w = w[n:]
def dfdw(self):
assert self.x is not None, 'dfdw called before f'
return np.empty((len(self.x), 0))
class FCLayer(Layer):
def __init__(self, V, b):
V, b = tensorize(V), tensorize(b)
def f(x):
self.x = tensorize(x)
return np.dot(self.parms[0], self.x) + self.parms[1]
def dfdx():
assert self.x is not None, 'dfdx called before f'
return self.parms[0]
Layer.__init__(self, [V, b], f, dfdx)
def dfdw(self):
assert self.x is not None, 'dfdw called before f'
m, n = self.parms[0].shape
D = np.zeros((m, m * (n + 1)))
js, je = 0, n
for i in range(m):
D[i][js:je] = self.x
js, je = js + n, je + n
D[:, (m * n):] = np.diag(np.ones(m))
return D
def __initialWeights(m, n, r=None):
if r is None:
r = np.sqrt(2/m) # Formula by He et al.
V = np.random.randn(n, m) * r
b = np.zeros(n)
return V, b
@classmethod
def ofShape(cls, m, n, r=None):
V, b = FCLayer.__initialWeights(m, n, r)
return cls(V, b)
def reset(self, r=None):
self.x = None
n, m = self.parms[0].shape
V, b = FCLayer.__initialWeights(m, n, r)
self.parms = [V, b]
The ReLU has no learnable parameters, and therefore the Jacobian of its output with respect to the parameter vector is empty. Because of this, a ReLULayer
class can inherit dfdw
from Layer
.
If the input to a ReLU is a $d$-dimensional vector, its output is a $d$-dimensional vector as well. Therefore, the Jacobian of the ReLU output with respect to its input is a $d\times d$ array.
Write an explicit formula for the Jacobian matrix $J_{\rho}$ of $$ \mathbf{z} = \rho(\mathbf{x}) $$ where $\rho$ is the ReLU and $\mathbf{x} = [a, b]^T$.
"Explicit formula" here means that you are supposed to write a matrix, with a formula for each of its entries. To help you understand the format, the formula for the first entry is
$$ \text{step}(a) $$ where $$ \text{step}(a) = \frac{1}{2}(1 + \text{sign}(a)) \;. $$
What happens for $a=0$ or $b=0$ is not important.
The code below is a partial implementation of a ReLU layer. Replace the two pass
commands with code so that the ReLULayer
class works correctly.
class ReLULayer(Layer):
def __init__(self):
def f(x):
self.x = tensorize(x)
pass
def dfdx():
assert self.x is not None, 'dfdx called before f'
pass
Layer.__init__(self, [], f, dfdx)
Remember that dfdx
will not work unless you first call f(x)
, which remembers the input x
.
Your code should at the very least result in a correct plot in the test below. Keep this code in your document, so we see the plot. Keep in mind that the fact that the plot is correct is a necessary but not sufficient condition for your code to be correct.
try:
xs = np.linspace(-1, 1, 101)
relu = ReLULayer()
out = [(relu.f(x), relu.dfdx()) for x in xs]
y = [a[0] for a in out]
dydx = [a[1] for a in out]
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
plt.plot(xs, y, label='ReLU')
plt.plot(xs, dydx, label='ReLU derivative')
plt.legend()
plt.show()
except:
pass
The softmax
function
$$ \sigma(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_{k=0}^{d-1} e^{x_k}} $$
has no parameters, so a class SoftmaxLayer
that implements it can inherit dfdw
from Layer
.
The softmax
is a function $\mathbb{R}^d\rightarrow\mathbb{R}^d$, and its Jacobian $J_{\sigma}$ with respect to $\mathbf{x}$ is therefore $d\times d$.
Show that if $\mathbf{s} = \sigma(\mathbf{x})$ then $$ J_{\sigma} = \text{diag}(\mathbf{s}) - \mathbf{s}\, \mathbf{s}^T $$ where $\text{diag}(\mathbf{s})$ is a square matrix with the entries of $\mathbf{s}$ on its main diagonal.
Hint: Compute a formula for a generic entry on the diagonal and one for a generic entry off the diagonal.
Show that if $c$ is any real number and $\sigma(\mathbf{x})$ is the softmax
function, then
$$
\sigma(\mathbf{x} - c) = \sigma(\mathbf{x})\;.
$$
Subtracting a scalar from a vector subtracts the scalar from each entry of the vector.
The code below is a partial implementation of a softmax
layer. Replace the pass
command with code so that the SoftmaxLayer
class works correctly.
The function f
is fully implemented. The subtraction of np.max(x)
from x
in __softmax
does not change the result (from the previous problem), and ensures that the exponents do not grow too large, thereby avoiding numerical overflow.
class SoftmaxLayer(Layer):
def __softmax(x):
e = np.exp(x - np.max(x))
return e / np.sum(e)
def __init__(self, n):
def f(x):
self.x = tensorize(x)
return SoftmaxLayer.__softmax(self.x)
def dfdx():
assert self.x is not None, 'dfdx called before f'
s = SoftmaxLayer.__softmax(self.x)
pass
Layer.__init__(self, [], f, dfdx)
Similar remarks as for ReLULayer
concerning the test code below.
try:
ps = np.linspace(-2, 2, 101)
q = 0
softmax = SoftmaxLayer(2)
out = [(softmax.f((p, q)), softmax.dfdx()) for p in ps]
y0 = [a[0][0] for a in out]
y1 = [a[0][1] for a in out]
dydx00 = [a[1][0, 0] for a in out]
dydx01 = [a[1][0, 1] for a in out]
plt.figure()
plt.plot(ps, y0, label='y0')
plt.plot(ps, y1, label='y1')
plt.plot(ps, dydx00, label='dy0/dx0')
plt.plot(ps, dydx01, label='dy0/dx1')
plt.xlabel('x0')
plt.legend()
plt.show()
except:
pass
The code below is a partial implementation of a Loss
class that embodies the cross-entropy loss and its Jacobian. Replace the two pass
commands with code so that the Loss
class works correctly.
The loss function is not part of the network, so this class is not a Layer
.
For consistency with the lecture notes, the second argument to the loss function is called p
, and is the output from the network's softmax
layer. The first argument is y
, the true label.
This class still conforms to the convention that method f
remembers its input p
in self.p
, for later use by dfdx
. The true label y
is not remembered across calls.
Use the natural logarithm for the cross-entropy loss.
The value of py
is clipped to self.small
to prevent numerical overflow.
class Loss:
def __init__(self):
self.small = 1e-8
def f(self, y, p):
self.p = tensorize(p)
py = self.p[int(y)]
if py < self.small: py = self.small
pass
def dfdx(self, y):
assert self.p is not None, 'dfdx called before f'
y = int(y)
d = np.zeros(len(self.p))
py = self.p[y]
if py < self.small: py = self.small
pass
Similar remarks as for ReLULayer
concerning the test code below.
try:
loss = Loss()
l, d, y = [], [], 0
xs = np.linspace(0, 1, 301)
for x in xs:
p = (x, 1-x)
l.append(loss.f(y, p))
dfdx = loss.dfdx(y)
if dfdx is not None: d.append(dfdx[y])
plt.figure()
plt.plot(xs, l, label='loss')
plt.plot(xs, d, label='loss derivative')
plt.ylim((-20, 20))
plt.legend()
plt.show()
except:
pass
This part uses the layers developed in the previous part to build a neural network. The network is limited, in that it is made of fully connected layers, ReLUs, and a softmax
function. Your job is to implement back-propagation and stochastic gradient descent for this network.
The code below is a partial implementation of a neural network. The only missing code is part of the backprop
method that implements back-propagation. As usual, this part is replaced by a pass
statement. Write code in place of this pass
so that backprop
computes its output by back-propagation.
The method takes a single training data input x
, its corresponding true label y
, and the instance loss
of class Loss
.
The method returns a pair. The first element is the value L
of the loss, and the second is the gradient of the loss with respect to a vector containing all the network parameters. The entries in this vector are in the order in which they would be returned by the network's getWeights
method.
The one instruction already implemented in backprop
is the forward pass through the network. This will store all layer inputs in the appropriate places.
The lecture notes on training neural nets by back-propagation mention two Jacobians matrices in equation (5). These matrices are respectively layer.dfdw()
and layer.dfdx()
if layer
is the current layer (layer number $k$ in the notes). Note that $\mathbf{x}^{(k)}$ is the output from that layer in the lecture notes, and the input is $\mathbf{x}^{(k-1)}$.
Rewrite the equations for back-propagation (equations (4), (2), (3), used in this order) in terms of the code notation before attempting to write code. Do not hand in your re-writing.
Remember that some layers do not have parameters, so there is no point in recording the corresponding Jacobians. The simplest way to tell that there are no parameters is to check that the result of calling layer.dfdw()
has size zero.
This function takes some thinking to write, but is rather simple. My replacement for the pass
is eight lines of code.
class Network:
def __init__(self, sizes):
self.layers = []
for i in range(len(sizes) - 1):
self.layers.append(FCLayer.ofShape(sizes[i], sizes[i+1]))
self.layers.append(ReLULayer())
self.layers.append(SoftmaxLayer(sizes[-1]))
self.p = None
def reset(self, r=None):
for layer in self.layers: layer.reset(r)
self.p = None
def getWeights(self):
return np.concatenate([layer.getWeights() for layer in self.layers])
def setWeights(self, w):
for layer in self.layers:
n = len(layer.getWeights())
layer.setWeights(w[:n])
w = w[n:]
def f(self, x):
x = tensorize(x)
for layer in self.layers: x = layer.f(x)
self.p = x
return self.p
def backprop(self, x, y, loss):
L = loss.f(y, self.f(x))
pass
Similar remarks as for ReLULayer
concerning the test code below.
This code tests back-propagation for a tiny network and on a single input x
. The weights are set to simple values (see the definition of w
at the top), so you can also check things by hand when you debug your code.
The gradient returned by backprop
is compared against the gradient computed numerically, using the Jacobian
function from a previous assignment. This function would be prohibitively slow to use for larger networks.
The "Gradient gap" printed by the test code is the norm of the difference between the two results, and should be much smaller than $10^{-5}$.
try:
net = Network((3, 2, 3))
L, x, y, w = Loss(), [0.1, -0.2, 0.3], 0, range(len(net.getWeights()))
net.setWeights(w)
def Lw(w):
net.setWeights(w)
return L.f(y, net.f(x))
def Jacobian(f, z, delta=1e-5):
return np.transpose([(f(z + delta * e) - f(z - delta * e)) / (2 * delta)\
for e in np.eye(len(z))])
net.setWeights(w)
backprop = net.backprop(x, y, L)
numerical = Jacobian(Lw, w)
backprop = backprop[1]
print('Gradient gap for Network:', np.linalg.norm(backprop - numerical))
except:
pass
Write a function with header
def sgd(net, loss, T, batch_size=1, max_iter=1, learning_rate_init=1e-3,
tol=1e-6, n_iter_no_change=10):
that uses your Network.backprop
function to implement stochastic gradient descent without momentum:
$$ \mathbf{w}_{t+1} = \mathbf{w}_t - r \nabla\mathbf{w}_t \;. $$
where the gradient is computed on mini-batches.
Place your code in a separate notebook cell before the test code.
net
is an instance of your Network
class, loss
is an instance of your Loss
class, and T
is a training set specified as a dictionary with keywords 'x'
and 'y'
(you will use the provided training set T
)
The keyword parameters batch_size
, max_iter
, learning_rate_init
, tol
, and iter_no_change
have the same meaning as the corresponding parameters in sklearn.neural_network.MLPClassifier
. Refer to the manual page for that function to determine the meaning of these parameters, so you know how to implement their funcitonality.
The learning rate $r$ is equal to learning_rate_init
and never changes.
The function sgd
does not initialize the network parameters. Creating an instance of Network
already does that.
Your function should return a list or array of the training risk $L_T$ at the end of every epoch, which the test code will plot.
Your function should scramble the training data randomly once, at the beginning. Do not re-scramble in each epoch. You may either scramble the data itself, or create a scrambled list or array of indices into it.
The initial value of $\mathbf{w}$ can be read with the call net.getWeights()
, and the updated value can be written into net
with the call net.setWeights(w)
.
A simple way to generate an array of scrambled indices into T
is as follows:
n = len(T['y'])
idx = np.argsort(np.random.random(n))
The simplest way to access the mini-batches in sequence is to use a Python generator. If you don't know what that means, just write ordinary code with a for
loop.
A risk is an average loss, so remember to normalize properly.
What the MLPClassifier
calls an "iteration" is an epoch, that is, a single run through all the mini-batches from the training set.
The implementation of Network
is inefficient, and this is why the test code only runs 200 epochs of training. You should not expect training to achieve a small risk at the end of this small number of iterations. If you are curious, run your code longer, but do not turn in the result of that.
Since it takes a couple of minutes for the test code to run, you may want to add some progress report instructions to your code, so you see what is going on. However, this is not required.
The test code assumes that the files x
and y
provided with this assignment are in the same directory as your notebook.
The network created in the test code is very similar, but not identical, to the network you will work with in part 4.
In this part, you will use the simple MLPClassifier
neural network made available in sklearn.neural_network
. This implementation is much more efficient than the one you developed in the parts above. In addition, using pre-made code makes sure you can solve the problems below even if your earlier code does not work.
This part uses the state-of-the-art Adam optimizer to train a network. Adam is a variant of Stochastic Gradient Descent (SGD) that evaluates the step size adaptively. Convergence is typically much faster than with standard SGD, even if the Adam code is quite simple. See this ArXiV paper if you are interested in the details.
Assuming that the appropriate files are in the same directory as your notebook, the following code loads some simple training and test datasets.
The data space $X$ is two-dimensional, so we can draw the decision regions by the usual method of classifying every point on a fine grid. The true decision regions are displayed below.
def readArray(filename):
with open(filename, 'r') as file:
X = np.array([[float(a) for a in line.strip().split()] for line in file])
return X
X = readArray('x')
y = readArray('y').flatten()
XTest = readArray('xTest')
yTest = readArray('yTest').flatten()
T = {'x': X, 'y': y}
S = {'x': XTest, 'y': yTest, 'shape': (50, 50)}
SGrid = {'x0': S['x'][:, 0].reshape(S['shape']),
'x1': S['x'][:, 1].reshape(S['shape']),
'y': S['y'].reshape(S['shape'])}
def showRegions(grid, title, s = None):
if s:
plt.subplot(2, 3, s)
else:
plt.figure()
plt.contourf(grid['x0'], grid['x1'], grid['y'],
cmap=plt.cm.RdBu, alpha=0.3)
plt.axis('equal')
plt.axis('off')
plt.title(title)
plt.figure(1)
showRegions(SGrid, 'True Regions')
The code below runs exactly the same training procedure on exactly the same data six times, each time using a different random seed for initialization. The resulting decision regions are displayed. The code may take a minute or so to run on a laptop.
from sklearn.neural_network import MLPClassifier
layerSizes = (16, 16, 2)
SHatGrid = SGrid
state = (2, 3, 6, 12, 13, 9)
lossCurve = []
plt.figure(figsize=(10,10))
for k in range(len(state)):
network = MLPClassifier(hidden_layer_sizes = layerSizes, tol=1e-6,
max_iter = 10000, random_state=state[k], verbose=False)
network.fit(T['x'], T['y'])
lossCurve.append(network.loss_curve_)
SHatGrid['y'] = network.predict(S['x']).reshape(S['shape'])
showRegions(SHatGrid, 'Predicted Regions ' + str(state[k]), k+1)
plt.show()
Comment briefly on the importance of initialization when training a neural network. Why is this an issue in terms of advancing our empirical understanding of neural networks? Keep in mind that for larger networks training can take days or weeks, even on a cluster of GPUs.
The following code displays the histories of the training risk recorded in the previous experiment.
plt.figure()
for k in range(len(lossCurve)):
plt.plot(lossCurve[k], label=str(state[k]))
plt.xlabel('epoch')
plt.ylabel('training risk')
plt.legend()
plt.show()
In what way are these plots consistent with the decision regions displayed earlier? Comment in particular about the experiment with random_state=9
.