When you submit your work, follow the instructions on the submission workflow page * scrupulously* for full credit.

**Important: Failure to do any of the following will result in lost points:**

Submit

**one**PDF file and**one**notebook per groupEnter

**all the group members in Gradescope**, not just in your documents. Look at this Piazza Post if you don't know how to do thisMatch each

**answer**(not question!) with the appropriate page in GradescopeAvoid large blank spaces in the middle of an answer in your PDF file

The statement I made in class that the function $f(x) = a x + b$ is generally nonlinear (even if its plot is a line) seems to have surprised some, so let us prove it. This will also rekindle our proof-writing skills.

We use a definition of linearity that, although not as pithy as the one used in class, is less confusing to work with. A function $g(x)\ :\ \mathbb{R}\rightarrow\mathbb{R}$ is *linear* if for every $c, d, x, y\in\mathbb{R}$ the following equality holds:

$$ g(cx + dy) = c g(x) + d g(y)\;. $$

Prove that the function $\,f\ :\ \mathbb{R}\rightarrow\mathbb{R}$ defined by

$$ f(x) = a x + b $$

where $a, b$ are real numbers is linear if and only if $b = 0$.

A proof consists of a clear and concise explanation, so it will have simple, correct math and full sentences in good English. The reason why each equality in the math holds has to be explained. A proof of an equivalence ("if and only if") either proceeds by equivalences or has two separate parts, one for the "if" and one for the "only if." These two styles can be mixed in the same proof. The choice is yours. A proof ends with a conclusion that restates what has been proven.

Proofs need not be pedantic. For proofs of complex statements, this means that one typically assumes knowledge of common facts of mathematics, without stating those in detail in the proof. For instance, if the proof involves the equality

$$ \int_0^{\infty} e^{-x}\, dx \;=\; 1\;, $$ the proof may not explain why this equality holds.

For proofs of elementary facts, on the other hand, one can afford the luxury of explaining almost everything. The statement to prove in this problem is very elementary, so make sure you mention all the main properties you invoke, such as commutativity or the distributive property of one operation over another. If an equality follows from a definition, so state. If in doubt, err on the side of pedantry. However, precision and verbosity rearely go hand in hand.

This part helps you absorbe some details of the argument that was used in the class notes on functions and data fitting to count the number of coefficients in a multivariate polynomial. Please review those notes before proceeding.

The general polynomial in $d=2$ variables and of degree up to $k=3$ is

$$ P_{2,3}(\mathbf{x}) = c_{0} + c_{1} x_{1} + c_{2} x_{2} + c_{3} x_{1}^{2} + c_{4} x_{1}x_{2} + c_{5} x_{2}^{2} + c_{6} x_{1}^{3} + c_{7} x_{1}^{2}x_{2} + c_{8} x_{1}x_{2}^{2} + c_{9} x_{2}^{3} $$

Examples of monomials are $1$, $x_{1}^{3}$, $x_{1}^{2}x_{2}$. These have degree $0,3, 3$, respectively.

Let $\mathcal{H}_{d,k} = \{P_{d, k}\}$ be the hypothesis space of all polynomials in $d$ variables and of degree up to $k$.

Please stick to the notation above in this part, paying attention to the following conventions:

Variables have subscripts from $1$ to $d$. In each monomial they are ordered by increasing subscript.

The degree-zero monomial $1$ is not written out explicitly in a polynomial, nor are any variables that are raised to the power $0$. A power of $1$ is left unwritten.

Coefficients have subscripts starting at $0$, and are ordered by increasing subscript in the exprssion for the polynomial.

Monomials are ordered by increasing degree. Within each degree, the ordering of monomials does not matter. For instance, the sub-expression

$$ c_{3} x_{1}^{2} + c_{4} x_{1}x_{2} + c_{5} x_{2}^{2} $$

could also be written as follows:

$$ c_{3} x_{2}^{2} + c_{4} x_{1}x_{2} + c_{5} x_{1}^{2} $$

(which of course leads to a different interpretation of the coefficients).

However, please try to keep some reasonable ordering to save us time when we check your answers.

The class notes mentioned earlier establish a bijection between all *monomials* in $\mathcal{H}_{d,k}$ and the binary strings with exactly $d$ ones and $k$ zeros. This bijection helps count the monomials, because the strings have a very constrained format, and counting them is therefore easier. As an example (the definition is in the notes), when $d = 4$ and $k=10$, the monomial

$$ \mu \;=\; x_1^5 x_3 x_4^2 $$

corresponds to the string

$$ s \;=\; 00100000110100 \;. $$

Recall that the number of zeros before the first $1$ in $s$ is the difference between the maximum degree $k$ and the degree $k'$ of the monomial. In the example, the degree of the monomial is $k' = 5 + 1 + 2 = 8$, so there are $q = k - k' = 2$ zeros before the first $1$ in $s$. Including these zeros in $s$ ensures that this string has always length $k+d$. In other words, $\mu$ can be padded with $1^{k-k'}$ so that the powers add up exactly to $k$. In the example:

$$ \mu' \;=\; 1^2 x_1^5 x_3 x_4^2 $$

represents the same monomial.

A useful representation that is intermediate between $\mu$ and $s$ is the vector that collects the $d+1$ powers that appear in $\mu'$, counting also the power of $1$ and any zero powers. In the example:

$$ \mathbf{p} \;=\;[2, 5, 0, 1, 2]\;. $$

What makes this representation useful is that it connects simply to both $\mu$ and $s$. On one hand $\mathbf{p}$ lists the powers of $\mu'$ directly and therefore, if the first element of $\mathbf{p}$ is ignored, those of $\mu$. On the other hand, the vector $\mathbf{p}$ contains the lengths of the runs of zeros in $s$, and is therefore a rather direct description of $s$ as well.

The string $s$ and the vector $\mathbf{p}$ represent a monomial uniquely *only within* $\mathcal{H}_{d,k}$. For instance, the empty string $s$ represents the constant monomial $1$, which has degree $k' = 0$, when this monomial is thought of as a member of $\mathcal{H}_{0, 0}$. The corresponding vector is

$$ \mathbf{p} \;=\; [0]\;. $$

More generally, in $\mathcal{H}_{d,k}$, the constant monomial $1$ can be written as

$$ \mu \;=\; 1^k x_1^0\ldots x_d^0 $$

and is therefore represented by the string

$$ s \;=\; \underbrace{0\ldots 0}_{k} \underbrace{1\ldots 1}_{d} $$

and by the vector

$$ \mathbf{p} \;=\; [k, \underbrace{0,\ldots, 0}_{d}] \;. $$

As a sanity check, note that the string $s$ has always length $d+k$ and has $d$ ones, and the vector $\mathbf{p}$ has always $d+1$ entries (and is therefore never empty) and its entries add up to $k$.

Write an expression for the general polynomial of degree up to $k=2$ in $d=3$ variables.

How many coefficients does $P_{3,4}(\mathbf{x})$ have?

Make a table with the monomial $\mu'$ (not $\mu$), string $s$, and vector $\mathbf{p}$ for each of the monomials in $P_{3, 2}$. Use one table row for each monomial, and list the monomials in the order in which they appear in your expression for $P_{3, 2}$.

As an exception to the rules above, write every power of $1$ in $\mu'$ explicitly, even when it is $0$ or $1$.

Write two Python 3 functions with headers
`def p2s(p):`

and `def s2p(s):`

that convert between $\mathbf{p}$ and $s$. Show your function definitions and the outputs produced by the code below.

In your code, `p`

is represented as a list of integers and `s`

is represented as a string of characters, with only `'0'`

and `'1'`

allowed. The function `p2s`

returns `s`

and the function `s2p`

returns `p`

.

You need not include any checks in your code (such as nonnegative entries in `p`

, spurious characters in `s`

, and so forth).

In some cases, the output from `p2s`

is the empty string. The corresponding `print`

statement below will just produces a new line (which is the standard terminator for `print`

). There is nothing wrong with that.

In [1]:

```
try:
print(p2s([1]))
print(p2s([0]))
print(p2s([0, 0]))
print(p2s([2]))
print(p2s([0, 0, 0]))
print(p2s([4, 2, 1, 0, 3, 0]))
print(s2p('0'))
print(s2p(''))
print(s2p('1'))
print(s2p('00'))
print(s2p('11'))
print(s2p('000010010110001'))
except NameError:
pass
```

This part helps you understand the basics of $k$-nearest-neighbors ($k$-NN) prediction. Because of this, you are *not* allowed to use or access any programming library function that implements any version of nearest-neighbor prediction. On the other hand, any function in the standard Python library or in `numpy`

is fair game.

The term "$k$-NN predictor" really refers to a *family* of predictors, since defining a specific $k$-NN predictor that computes $y = h(\mathbf{x})$ involves two choices:

How to measure the distance $d(\mathbf{x}, \mathbf{x}')$ between two data points in the data space $X$ in order to identify the

*nearest*neighbors.How to compute the output summary $y = s(y_1, \ldots, y_k)$ of the $k$ values $y_1,\ldots, y_k$ found in the training set $T$ for the $k$ nearest neighbors $\mathbf{x}_1,\ldots, \mathbf{x}_k$ of a given input data point $\mathbf{x}$. For instance, a classifier typically outputs the majority value, while a regressor typically outputs either the mean or the median of the values.

The function `knn`

you will write below implements the whole family, and takes the distance and summary functions as input arguments.

Write a Python 3 function with header

`def knn(x, k, T, distance, summary):`

that takes a `numpy`

array `x`

of data points, a positive integer `k`

, a training set `T`

, a distance function, and a summary function. The function `knn`

returns the value $y = h(\mathbf{x})$ of the $k$-NN prediction for data point $\mathbf{x}$ using `distance`

to determine the nearest neighbors of $\mathbf{x}$ and `summary`

to summarize the corresponding $k$ values.

Show your code.

If the data space $X$ is $d$-dimensional, then the

`numpy`

array`x`

is a vector with $d$ entries`k`

is any positive integer less than $N$, the number of samples in the training set.The training set

`T`

is a dictionary`{'x': Tx, 'y': Ty}`

where`Tx`

is a $N\times d$`numpy`

array and`Ty`

is a $N\times 1$`numpy`

array.The function

`distance`

has header`def distance(x, xp):`

. It takes two data points (each in the same format as`x`

) and returns a nonnegative real number.The function

`summary`

has header`def summary(Y):`

. It takes a $k\times 1$`numpy`

array`Y`

of real values and returns a single real value. The requirement that the values be real is unnecessarily restrictive in general, but it simplifies programming for this assignment.

Once the list

`D`

of all the distances between input point $\mathbf{x}$ and training data points $\mathbf{x}_n$ has been computed, one way to find the indices of the $k$ smallest distances would be to use`numpy.argsort`

to find the indices that sort`D`

, and then use the first $k$ indices. However, the complexity of sorting is $O(N\log N)$. In contrast, the function`numpy.argpartition`

computes what is called a*partial sort*and runs in time $O(N + k\log k)$. Since $k \ll N$, this is significantly faster. Use`numpy.argpartition`

in your answer. Please read the documentation for this function.The function

`knn`

is simple. For reference, the body of my implementation is three lines long, and this is just because I wanted the function to be readable.

Write functions `Euclidean`

, `mean`

, `median`

, and `majority`

. The first is a distance function, the other three are summary functions. They implement Euclidean distance in data space and mean, median, and majority in value space. To write `majority`

you may assume that all the values in the input vector are nonnegative integers. Break ties any way you like.

Show your code.

The bodies of these functions are one-liners. The norm of a vector can be computed with `numpy.linalg.norm`

, and `numpy`

has a `mean`

and a `median`

function. The functions `numpy.bincount`

and `numpy.argmax`

are useful when writing `majority`

.

Assuming that the file `data.pickle`

that comes with this assignment is in the same directory as your notebook, the following code will read the training set used to make Figure 4 in the class notes on nearest neighbor predictors.

The result `T`

is a dictionary, as in problem 3.1, and the dimensionality of the data space $X$ is $d=1$.

In [2]:

```
import pickle
with open('data.pickle', 'rb') as f:
T = pickle.load(f)
```

Write code to reproduce Figure 4(b) in the notes.

Show your code and the figure it produces.

The plot is obtained by predicting $h(x)$ for each integer $x$ between 0 and 6000 inclusive.

The distance function is the absolute difference (same as the Euclidean distance in one dimension!) and the summary function is the mean.

Do not plot the original data points (blue dots in the original Figure).

Line styles (colors, thicknesses) need not match the original, and you need not label the axes.

Include a legend as in Figure 4(b), placed anywhere reasonable (the default behavior is OK).

Remember to include the magic

`%matplotlib inline`

as in homework 2.

**Be patient, this computation may take a while.**You may want to first try your code with a sample of the $x$ values to make sure everything works, and then make the final plot.

Redraw the plot you obtained in the previous question for $k=10$, and superimpose on it the plot obtained with the same value of $k$ but using `median`

as the summary function. Add a legend that says `mean`

or `median`

for the corresponding plot.

Show code and output figure.

(This note is relevant only if you use a `for`

loop to make your two plots.)

Read problem 3.5 before you run your code for problem 3.4. This will show you that you may need the results you obtain in 3.4 for 3.5 as well. Since it takes some time to compute these results, you may want to store them for later reference.

You will notice that the plots computed with mean and median differ for some values of $x$, but not much for other values of $x$. What would very different values of mean and median tell you about the values $y_1,\ldots, y_k$ in a neighborhood of some data point $x$? Explain, and provide an example from your plots.

In your explanation, refer to the real meaning of the data points. That is, use terms like "square feet" and "price" and "dollars."

Saying things like "when the mean and median are not very different, the distribution of values around data point $x$ has about equal mean and median" will obviously earn you no credit. Pasting things you don't understand from some web site won't, either.

As you learned the hard way, it takes quite a bit of time to run a $k$-NN classifier when the training set is large (and when you run it on many data points $x$, of course). Explain very briefly why this does not necessarily doom $k$-NN prediction to practical irrelevance.