This homework assignment explores some situations in which the geometric intuition we developed though our sensory experience in two or three dimensions does not carry into higher-dimensional spaces (part 1). It then reviews some terminology (part 2). Finally, in part 3, it previews, in a simple context and in simplified form, some concepts related to validation, which we will explore more deeply later in the course.

None of the answers in this assignment require mathematical proof.

When you submit your work, follow the instructions on the submission workflow page * scrupulously* for full credit.

This part explores some important differences between low-dimensional and high-dimensional worlds. These discrepancies should caution us not to rely too much on geometric intuition when we work in spaces with many dimensions.

Let $d$ be a positive integer. Given a positive real number $a$, the set
$$
C(d, a) = \{\mathbf{x}\in\mathbb{R}^d \ :\ -a/2 \leq x_i \leq a/2
\;\;\text{for}\;\; i=1,\ldots,d\}
$$
is a *cube* of side-length $a$, centered at the origin. The *unit cube* is $C(d, 1)$.

The *unit sphere* $S(d)$ in $d$ dimensions is the set of points at Euclidean distance at most 1 from the origin:
$$
S(d) = \{\mathbf{x}\in\mathbb{R}^d \ :\ \|\mathbf{x}\| \leq 1 \}\;.
$$

The unit sphere $S(d)$ is exactly inscribed in the *2-cube* $C(d, 2)$, as shown on the left in the figure below for $d=2$. Let us call the set
$$
V(d, 2) = C(d, 2) \setminus S(d)
$$
(where '$\setminus$' denotes set difference) the *corners* of the 2-cube. This is the part of the 2-cube that is not in the sphere inscribed in it, and is shown in gray on the left in the figure below for $d=2$.

Let $\epsilon$ be a small positive real number ($0<\epsilon \ll 1$). The $\epsilon$*-skin* $E(d, \epsilon)$ of the unit cube $C(d, 1)$ is a layer of thickness $\epsilon/2$ just inside the surface of $C(d, 1)$:
$$
E(d, \epsilon) = C(d, 1) \setminus C(d, 1-\epsilon) \;.
$$
The figure on the right below shows the $\epsilon$ skin $E(2, \epsilon)$ in gray for some small $\epsilon$.

The *volume* of a compact set $\mathcal{A}\in\mathbb{R}^d$ is the integral of $1$ over $\mathcal{A}$.

It can be shown (see for instance C. M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006) that the volume of $S(d)$ is
$$s(d) = \frac{2\pi^{d/2}}{d\ \Gamma(d/2)}$$
where $\Gamma(z)$ is the gamma function.

The Python function `scipy.special.gamma`

computes $\Gamma(z)$. In particular, for $d=1,2,3$

$$ \Gamma(1/2) = \sqrt{\pi}\;\;\;\text{,}\;\;\; \Gamma(1) = 1 \;\;\;\text{,}\;\;\; \Gamma(3/2) = \sqrt{\pi}/2 \;. $$

What is the volume $c(d, a)$ of $C(d, a)$ and, in particular, what is $c(d, 1)$? No need to prove your answers.

Give a simple formula for the volume $e(d, \epsilon)$ of $E(d, \epsilon)$ in terms of $d$ and $\epsilon$.

Use the function `matplotlib.pyplot.loglog`

to plot $e(d, 10^{-3})$ for $d$ between $1$ and $10^{6}$, and with logarithmic scaling on both axes. Label the axes.

You may want to refer to Homework 1 for how to display a plot.

To make

`matplotlib`

code work correctly with multiple figures in the same notebook, put the line`%matplotlib inline`

before you do your first plot.

Of course, there is no need to compute values for

*every*integer $d$ in the requested range. The function`numpy.logspace`

with default values for the optional arguments gives a reasonable sampling.

Write a simple sentence that describes where most of the points in the unit cube are when the number $d$ of dimensions is large and $\epsilon$ is small. What is the approximate numerical value of the fraction of points in the $\epsilon$-skin of the unit cube when $d=2$ and $\epsilon=10^{-3}$?

Write a formula for the ratio $r(d)$ between the volume of the unit sphere and that of the 2-cube. Simplify your formula if possible and show your calculations.

Plot $r(d)$ (as a regular plot, no logarithms) for every integer $d$ between 1 and 10. Since only integer values of $d$ make sense, draw dots connected by a solid line (that is, specify options `marker='.', markersize=12`

for `matplotlib.pyplot.plot`

).

Approximately what percentage of the volume of a cube is contained in the sphere inscribed the cube when $d=2$? And when $d=10$?

The input $\mathbf{x}$ to all predictors mentioned in the problems that follow is a black-and-white image with $307,\negthinspace 200 = 480\times 640$ pixels arranged in $480$ rows and $640$ columns. For simplicity, the value of a pixel is a real number, without further restriction. The feature function $\phi$ defined in class (lecture on functions and data fitting) is the identity function. As a consequence, the functions $f$ and $h$ are the same, and so are the two domains $A$ and $X$.

For a classifier, the *zero-one loss* $\ell(y, y')$ is a loss function that assignes a value $1$ to a classification mistake (in which the true output $y$ is different from the predicted output $y'$) and a value $0$ to a correct prediction. This loss can be represented by a matrix that has one row per possible true output $y$ and one column per possible predicted ouptut $y'$. Entry at row $y$ and column $y'$ of this matrix contains the value of the loss corresponding to $y$ and $y'$.

The output $y$ to a classifier $h$ is the activity performed by the single person visible in the image $\mathbf{x}$. That person is playing tennis, and the possible activities are *serve*, *forehand*, *backhand*, *volley*, and *other*. The *other* activity covers any shot that is not of the other types, and any other activity or lack thereof.

a. If the signature of $h$ is $X\rightarrow Y$, what are $X$ and $Y$, in terms of quantities given in the text for this part?

b. Write the zero-one loss matrix for this classifier.

Label your answers a and b.

The output $\mathbf{y}$ of a predictor $h$ (different from the one in the previous problem) is a vector of $25$ parameters that describe the tennis player's body configuration in the given image $\mathbf{x}$. You can think of these $25$ real numbers as angles that describe the orientation of each body joint (shoulder, elbow, knee,...) relative to the limb the joint is attached to (torso, arm, thigh,...).

a. Is $h$ a classifier or a regressor?

b. If the signature of $h$ is $X\rightarrow Y$, what are $X$ and $Y$, in terms of quantities given in the text for this part?

c. Why can the loss function not be represented by a matrix for this problem?

The problems in this part explore a simplified version of concepts we will cover later in this class.

The key difference between data fitting and machine learning is that the former just needs to do well on the training data, while the latter is expected to do well also on previously unseen data. One way to check that it does so is to have two data sets instead of one: The training set $T$, on which the predictor $h$ is trained, and a separate set $V$ called the *validation* set, disjoint from $T$. (Depending on context, the second set could also be called the *test* set. More on this later in the course.)

The average loss of $h$ on $T$ is called the *training risk*, and the average loss on $V$ is called the *validation risk*. Both risks are *empirical*, in the sense that they are estimates computed on data. The validation set plays the role of "previously unseen data" and the validation risk is a (possibly crude) estimate of how $h$ performs on "previously unseen data."

We explore these concepts in the context of univariate polynomial regression, which we already saw in homework 1. Rather than writing our own fitting function, we use `numpy.polyfit`

. Please look at its manual page online. Note in particular that this function retuns an array of coefficients in order of *decreasing* power. The function `numpy.polyval`

takes a polynomial from `numpy.polyfit`

and evaluates it at a given set of points.

If the file `data.pickle`

that comes with this assignment is in the same directory as your notebook, the following code reads a dictionary `data`

from that file. The dictionary contains a training set `data['T']`

and a validation set `data['V']`

.
Each data set is in turn a dictionary. For instance, if you define

```
T = data['T']
```

for training set

$$ T = \{(x_1, y_1),\ldots, (x_N, y_N) \} $$

then `T['x']`

is a `numpy`

one-dimensional array with all inputs $x_n$ and `T['y']`

is a similar array with all corresponding outputs $y_n$ in $T$. The format of the validation set $V$ is the same.

In [1]:

```
import pickle
with open('data.pickle', 'rb') as f:
data = pickle.load(f)
```

Write a function `train`

with two arguments. The first argument `T`

is a training set represented like `data['T']`

above. The second argument `k`

is a nonnegative integer. The function `train`

fits a univariate polynomial of degree `k`

to the data in `T`

.

Show your code and the result of fitting a polynomial of degree 3 to `data['T']`

. Specifically, show the coefficients of the polynomial in decreasing order of degree and with three decimals after the period, and then use the function `show`

below (similar to that in homework 1, but using `numpy.polyval`

) to plot both training set and polynomial.

In [2]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def show(x, y, hList = []):
plt.plot(x, y, marker='.', markersize=12, ls='')
npt = 100
xrange = [min(x), max(x)]
xFine = np.linspace(xrange[0], xrange[1], npt)
for h in hList:
hxFine = np.polyval(h, xFine)
plt.plot(xFine, hxFine, label= 'degree ' + str(len(h)-1))
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
```

Hints:

- The body of the function
`train`

is one line of code, or perhaps two. - You should not expect a great fit, as the data was not generated with a third-degree polynomial.

Write a function `loss(y, yp)`

that takes the true value `y`

and the predicted value `yp`

for a regression problem and returns the quadratic loss $\ell(y, y') = (y - y')^2$. The arguments are `numpy`

arrays of equal size, and the output is an array of the same size with all of the losses.

Then write a function `L(h, T, l)`

that takes a `numpy`

vector `h`

of polynomial coefficients, a training set `T`

represented as a dictionary with entries `x`

and `y`

as shown earlier, and a loss function `l`

of the same signature as `loss`

, and computes the empirical risk of `h`

on `T`

with loss `l`

:

$$ L_T(h) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, h(x_n)) $$

where $N$ is the number of samples in $T$.

Show your code and the numerical value of the empirical risk for the fit in problem 3.1. Use three decimal digits after the period.

(Hint: `numpy.mean`

comes in handy.)

Make a figure that superimposes two plots in the same diagram, each plot represented as a sequence of dots conencted by lines (as in the function `show`

above).

The first plot is the *training risk*, that is, the empirical risk on `data['T']`

for the fit in problem 3.1 but for polynomial degrees $k = 0,\ldots, 12$, using the quadratic loss. The second plot is the *validation risk*, that is, the analogous quantity estimated on `data['V']`

(of course, for both plots the polynomials are trained on `data['T']`

).

Label the axes of the figure as "k" and "risk" and place a legend that specifies which plot is which.

Show both code and plot.

For what value $k^{\star}$ of $k$ does the fit generalize best to the validation data? Also use the function `show`

to display the validation points and the fit to the training data for $k^{\star}$.

Why is the plot for the training risk in problem 3.3 monotonic non-increasing? (If your plot doesn't look that way, you may want to review your solution to 3.3.) Explain briefly and clearly.

There are two technical terms that denote the fact that the validation risk is (i) higher for $k < k^{\star}$ and (ii) higher for $k > k^{\star}$, when compared to the validation risk at $k^{\star}$. What are the two terms? Specify which term is for which case ((i) or (ii)).