When you submit your work, follow the instructions on the submission workflow page * scrupulously* for full credit.

**Important: Failure to do any of the following will result in lost points:**

Submit

**one**PDF file and**one**notebook per groupEnter

**all the group members**for the PDF submission in Gradescope, not just in your documents. Look at this Piazza Post if you don't know how to do thisMatch each

**answer**(not question!) with the appropriate page in GradescopeAvoid large blank spaces in your PDF file

Homework 3 explored a possible implementation of $k$-Nearest-Neighbors ($k$-NN) predictors. That implementation assumed that no work at all is done at training time, other than storing the entire data set $T$. In this part of this assignment, you will be using the $k$-NN classifier in `scikit-learn`

instead. This implementation sets up a suitable data structure at training time in order to speed up inference.

We use Euclidean distance in all cases in this part.

To get you started, the following code computes the risk of a $k$-NN classifier, assuming that the labels $y\in Y$ are represented by integers. The classifier is trained on set `T`

and the risk is computed on set `S`

, both stored as dictionaries with keys `x`

and `y`

, as usual. The call to `h.fit`

sets up the data structure, so you can view that call as some sort of "training."

In [1]:

```
import numpy as np
from sklearn import neighbors
def risk(k, T, S):
h = neighbors.KNeighborsClassifier(k)
h.fit(T['x'], T['y'])
y = h.predict(S['x'])
return 1 - np.sum(S['y'] == y) / len(S['y'])
```

What loss does the `risk`

function above use?

What is the value of

```
risk(1, T, T)
```

regardless of what training set $T$ is used, assuming that all data points $\mathbf{x}$ in $T$ are distinct? Justify your answer with brief and clear text that involves this last assumption.

Would your answer to problem 1.2 change for the call

```
risk(k, T, T)
```

with $k > 1$? Explain briefly and clearly.

Let us now remove the assumption that all data points $\mathbf{x}$ in $T$ are distinct. Describe briefly and clearly what would have to happen for the value of

```
risk(1, T, T)
```

to have a different value from what you gave in your answer to problem 1.2.

The following code, adapted from the `scikit-learn`

tutorial, loads a miniature version of the MNIST handwritten digit recognition dataset used to train and evaluate classifiers. The data is stored in a dictionary `data`

with keys `'x'`

and `'y'`

, as usual.

The code below also displays four of the images as an example. The images are very small, $8\times 8$ pixels, and the code flattens them into vectors of $64$ numbers. Each vector is a row of `data['x']`

, and corresponds to a datapoint $\mathbf{x}$. The corresponding labels $y$ are integers between $0$ and $9$ and are stored in `data['y']`

.

In [2]:

```
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
digits = datasets.load_digits()
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)
plt.show()
data = {'x': digits.images.reshape((len(digits.images), -1)), 'y': digits.target}
```

We will look at plots of risk for different values of $k$, and we will compute risks for different random splits of the data into training and test set. Computing these plots takes some time, so it's useful to have an indicator of progress, just to know where we are with the computation. The following function is provided for your convenience. It is called once before the computation starts, with argument `p`

equal to the number of iterations in the computation. Then it's called once per iteration. It prints a message every 10% of the way.

In [3]:

```
from math import floor
def progress(p=0):
try:
progress.initialized = progress.initialized
except AttributeError:
progress.initialized = True
progress.count, progress.last, progress.decade = 0, 0, 0
if p > 0:
progress.count, progress.last, progress.decade = 0, p, 0
else:
progress.count += 1
decade = floor(10 * progress.count / progress.last)
if decade > progress.decade:
progress.decade = decade
print(10 * decade, '%', sep='', end=' ')
if decade == 10:
print()
```

A function to display the risk plots with error bars superimposed is also provided below. Use either `progress`

or `showRisk`

, or both, as you deem convenient.

In [4]:

```
def showRisk(kValues, mean, std):
plt.figure()
ax = plt.axes([0, 0, 1, 1])
plt.errorbar(kValues, mean, yerr=std, fmt='-o', lw=2,
capsize=5, ecolor='orange')
ax.set_xticks(kValues)
plt.xlabel('k')
plt.ylabel('Risk (Percent)')
plt.show()
```

Write code that repeats the following computation $30$ times: Split the `data`

into a training set `T`

that contains about 3/4 of the `data`

, *selected at random*, and a test set `S`

that contains the remaining `data`

. For each split, compute the

```
risk(k, T, S)
```

for all odd values of $k$ between $1$ and $21$ inclusive (we consider odd values only so we never have ties). Plot the risk values with error bars. The half-length of the bar for value $k$ is the standard deviation of the risk values over the 30 trials for that value of $k$.

Show your code and the resulting plot, with axes appropriately labeled (`showRisk`

does this for you), and report the mean risk for $k=1$, as a percentage (not a fraction) and with two decimal digits after the period.

It is important to loop over $k$

*inside*the loop over the trials. In this way, the $30$ splits are the same splits for every $k$, for a fair comparison.Do not count on the

`data`

being stored in a random order: They are not. To obtain a random permutation of a set of integers, you can use the function`numpy.random.permutation`

.It is simplest to first store all risk values in an array, and then use

`numpy.mean`

and`numpy.std`

to compute mean and standard deviation.Progress indicators are not required, but you may want to use

`progress`

anyway, as this is a lengthy calculation.

Your plot of the mean risk should by-and-large increases with $k$, except possibly for fluctuations that are small relative to the uncertainty encoded by the error bars. Because of this, $k=1$ is about as good as it gets. This may surprise you, given our class discussion about overfitting in the context of nearest-neighbor prediction. This problem explores why this may be the case for this data set.

For a particular data point $\mathbf{x}$ in the whole data set `data`

, let $d_s$ be the distance between $\mathbf{x}$ and the nearest data point $\mathbf{x}_s$ in `data`

, other than $\mathbf{x}$ itself, that has the *same* label as $\mathbf{x}$. Furthermore, let $d_d$ be the distance between $\mathbf{x}$ and the nearest data point $\mathbf{x}_d$ in `data`

whose label is *different* from that of $\mathbf{x}$.

Let us define the *margin* of $\mathbf{x}$ to be the difference

$$ \mu(\mathbf{x}) = d_d - d_s \;. $$

Is a $1$-NN classifier likely to do better when most margins are positive or negative? Justify your answer briefly and clearly.

Write code to compute a 100-bin histogram of the margin $\mu(\mathbf{x})$ of the data in `data`

. Show your code and the resulting plot. Your histogram should have properly labeled axes. The values in the histogram should add up to the number of data points, not to $1$. Also report the percentage (not fraction) of points with a negative margin. Display three decimal digits after the period.

Let

$$ f(\mathbf{x}) = \frac{x_1^2 + x_2^2}{10} + e^{-(\mathbf{x} - \mathbf{c})^T(\mathbf{x} - \mathbf{c})} $$

where

$$ \mathbf{x} = \left[\begin{array}{c}x_1\\x_2\end{array}\right]\;\;\;\;\;\text{and}\;\;\;\;\; \mathbf{c} = \left[\begin{array}{c}2\\1\end{array}\right]\;. $$

A labeled contour plot of $f(\mathbf{x})$ with $12$ contours is shown below. The function `f(x)`

computes samples of $f$ on a grid of $100\times 100$ equi-spaced samples in the square $-2 \leq x_1, x_2 \leq 4$.

In [5]:

```
import numpy as np
import matplotlib.pyplot as plt
def showContours(X, Y):
fig, ax = plt.subplots()
cs = ax.contour(X[:, :, 0], X[:, :, 1], Y, 12)
ax.clabel(cs, inline=1, fontsize=10)
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('image')
plt.show()
n, mn, mx = 100, -2, 4
x = np.linspace(mn, mx, n)
X = np.dstack(np.meshgrid(x, x))
def inner(x, y):
return x[:, :, 0] ** 2 + y[:, :, 1] ** 2
c, scale = [2, 1], 10
def f(x):
d = np.zeros(x.shape)
for i in range(2):
d[:, :, i] = x[:, :, i] - c[i]
y = inner(x, x) / scale + np.exp(-inner(d, d))
return y
Y = f(X)
showContours(X, Y)
```

Write mathematical formulas for the gradient $\nabla f(\mathbf{x})$ and the Hessian $H_f(\mathbf{x})$ of $f$. If you are not sure of your result, show your calculations.

A correct answer gets you full credit, but a compact one leads to simpler code later on.

Use your formulas to write code that draws a labeled contour plots of the norm of the gradient of $f$ on a grid of $100\times 100$ equi-spaced samples in the square $-2 \leq x_1, x_2 \leq 4$. Show your code and the resulting plot.

Use your formulas to write code that draws an image whose color depends on the signs of the eigenvalues of $H_f(\mathbf{x})$ on a grid of $100\times 100$ equi-spaced samples in the square $-2 \leq x_1, x_2 \leq 4$. Show your code and the resulting plot.

Suppose that your code computes an array `Lambda`

with shape `(100, 100, 2)`

so that `Lambda[i, j, :]`

contains the two eigenvalues at grid point $(i, j)$. Then, the function below displays the required image and prints out the meaning of the colors.

In [6]:

```
def showLambdas(Lambda):
L0 = Lambda[:, :, 0] > 0
L1 = Lambda[:, :, 1] > 0
sign = 0.5 * np.ones(Lambda.shape[:2])
sign[L0 & L1] = 1
sign[(~L0) & (~L1)] = 0
fig, ax = plt.subplots()
ax.imshow(sign, origin='lower')
plt.axis('off')
plt.show()
print('Mapping from colors to signs of the eigenvalues:')
print('Yellow: both positive; Green: mixed; Black: both negative')
```

How many stationary points, and of what types (maximum, minimum, saddle) does $f$ plausibly have in the region of the plot? Justify your answer.

The system of two equations

$$ \nabla f(\mathbf{x}) = \mathbf{0} $$

cannot be solved in closed form, so it would be difficult to *prove* your answer. You may be able to give a plausible answer to the question by reasoning about how the function $f$ is built (look at the code for it) and/or examining the contour plot of the gradient magnitude.

Code somehow like the following may help as well, where `G`

is the array with the values of the gradient magnitude:

```
fig, ax = plt.subplots()
ax.imshow(G < 0.1, origin='lower')
```