When you submit your work, follow the instructions on the submission workflow page * scrupulously* for full credit.

**Important: Failure to do any of the following will result in lost points:**

Submit

**one**PDF file and**one**notebook per groupEnter

**all the group members**in your PDF Gradescope submission, not just in your documents. Look at this Piazza Post if you don't know how to do thisMatch each

**answer**(not question!) with the appropriate page in GradescopeAvoid large blank spaces in your PDF file

The file `hw6.py`

that comes with this assignment defines the function `lineSearch`

we developed in homework 5, as well as three functions that compute the value, gradient, and Hessian of the Rosenbrock function we encountered in that assignment. These three funcitons are called `Rosenbrock`

, `RosenGrad`

, and `RosenHess`

, and take a `numpy`

array $\mathbf{z}\in\mathbb{R}^2$ as argument. They return respectively a scalar, a `numpy`

array of shape $2$, and a `numpy`

array of shape $(2, 2)$. The file `hw6.py`

also defines a floating point value `small`

to be used as a default value for numbers to be considered "small."

Write a function with header

```
def Newton(f, g, H, z0, maxK=1000, delta=hw6.small, normalize=False):
```

that takes three functions `f`

, `g`

, `H`

that compute value, gradient, and Hessian of some function $f$ and a starting point `z0`

(a one-dimensional `numpy`

array) and finds a minimum of $f$ by Newton's method. The optional parameter `maxK`

is the maximum number of iterations, and `delta`

is a threshold on the Euclidean norm of $\mathbf{z}_{k+1} - \mathbf{z}_{k}$. The algorithm stops when that norm falls below `delta`

. See the programming notes for the meaning of `normalize`

.

Show your code and the value `zNewton[-1]`

found by running `Newton`

on the Rosenbrock function, starting at $\mathbf{z}_0 = (-1.2, 1)$. Also print out the number of points in `zNewton`

.

For this part, you may not use any library or other code that implements Newton's method, either exactly or substantively.

Say

`import hw6`

to import the code from`hw6.py`

.Whenever the Hessian fails to be positive semidefinite, your code should fall back to a single call to

`lineSearch`

, rather than computing the Newton step. Remember to pass`delta`

to`lineSearch`

.Rather than returning the value $\mathbf{z}^{\ast}$ found for $\arg\min_{\mathbf{z}} f(\mathbf{z})$, the function

`Newton`

should return a list of`numpy`

arrays containing the points $\mathbf{z}_0,\ldots, \mathbf{z}_m$ encountered at each step of running`Newton`

. Here, $\mathbf{z}_0$ is the initial point, and $m$ is the number of iterations, so if`zNewton`

is the name for the value returned by`Newton`

, then the minimum it found is`zNewton[-1]`

.`Newton`

returns the list of values encountered even when it fails to find a minimum within`maxK`

iteration.Even if the Hessian is positive semidefinite, it can be degenerate (that is, it can have zero determinant). In that case, the function

`numpy.linalg.solve`

would fail. To address this problem, solve the system with the pseudo-inverse. Specifically, to solve

$$A\mathbf{z} = \mathbf{b}$$

you would write

```
z = np.dot(np.linalg.pinv(A), b)
```

The keyword argument

`normalize`

will be set to`True`

in a later problem, for reasons we will examine. For now, right after finding the new value (let us call that, say,`z1`

) of $\mathbf{z}$ in the current iteration of Newton's method, add code that checks if normalize is`True`

and, if so, normalizes the value to have norm $1$:`if normalize: if not np.all(z1 == 0): z1 /= np.linalg.norm(z1)`

How does the number of iterations from `Newton`

compare with that you found for steepest descent in homework 5 on the same optimization problem? You need not give exact numbers.

In this part, you will plot the history of steps taken by `Newton`

in a way that clarifies the method's geometry. This problem is also a good exercise in geometry and linear algebra. A theoretical prelude is needed first.

Let $f(\mathbf{z})$ be a function from $\mathbb{R}^2$ to $\mathbb{R}$, let $\mathbf{z}_0$ be the initial point in Newton's method, and define

$$ f_0 = f(\mathbf{z}_0) \;\;\;\;,\;\;\;\; \mathbf{g}_0 = \nabla f(\mathbf{z}_0) \;\;\;\;,\;\;\;\; H_0 = H_f(\mathbf{z}_0) $$ be the value, gradient, and Hessian of $f$ at $\mathbf{z}_0$. Throughout this part, assume that $H_0$ is positive definite (stronger than positive semidefinite).

Further, let $\mathbf{z}_1$ be the value of $\mathbf{z}$ obtained with one Newton step from $\mathbf{z}_0$, and let

$$ \Delta\mathbf{z} = \mathbf{z} - \mathbf{z}_0\;. $$

Then the method relies on the following approximation of $f$:

$$ f(\mathbf{z}) \approx g_2(\mathbf{z}) = f_0 + \mathbf{g}_0^{T} \Delta\mathbf{z} + \frac{1}{2} \Delta\mathbf{z}^T H_0 \Delta\mathbf{z} \;. $$

The *isocontour* of $g_2$ through $\mathbf{z}_0$ is the ellipse $E$ with equation

$$ g_2(\mathbf{z}) = f_0 \;, $$

that is,

$$ \frac{1}{2} \Delta\mathbf{z}^T H_0 \Delta\mathbf{z} + \mathbf{g}_0^{T} \Delta\mathbf{z} = 0\;. \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(1) $$

By construction, the minimum of $g_2$ is at $\mathbf{z}_1$.

Let $\mathbf{i}$ and $\mathbf{j}$ be two unit vectors on the plane that point along the axes of the ellipse $E$. Also, let $\ell_1, \ell_2$ be the half-lengths of the axes (a half-length is half the length). If we define the two $2\times 2$ matrices

$$ R = \left[\begin{array}{cc}\mathbf{i}^T\\\mathbf{j}^T \end{array}\right] \;\;\;\;\text{and}\;\;\;\; L = \left[\begin{array}{cc}\ell_1 & 0\\0 & \ell_2 \end{array}\right] $$

then the ellipse has equation

$$ (\mathbf{z} - \mathbf{z}_1)^T R^T L^{-2} R (\mathbf{z} - \mathbf{z}_1) = 1\;. \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(2) $$

To compare equations (1) and (2), which reprsent the same ellipse, we find the eigenvalues and eigenvectors of $H_0$, perhaps by using the function `numpy.linalg.eigh`

, since $H_0$ is symmetric. Let $d$ and $V$ be the arrays returned by this function (read the manual). The columns of $V$ are the unit eigenvectors of $H_0$, and the vector $d$ contains the corresponding eigenvalues of $H_0$. Because of this, and by the definition of eigenvalues and eigenvectors, we can write

$$ H_0 = V D V^T $$ where $D$ is a diagonal matrix with the entries of $d$ on the diagonal.

Since equations (1) and (2) represent the same ellipse, the rows of $R$ and the columns of $V$ must point along the same axes, and therefore be the same vectors up to signs. Because these matrices always show up in pairs in the equations above, the signs cancel, so we can ignore them and write

$$ R = V^T\;. $$

Scaling, on the other hand, is different in equations (1) and (2), so the two diagonal matrices $L$ and $D$ differ by a multiplicative factor $\alpha$:

$$ \alpha L^{-2} = D\;. $$

We could find $\alpha$ by algebraic manipulation, but there is an easier way: Rewrite equation (2) in terms of $V$ and $D$, and then require $\mathbf{z}_0$ to satisfy the equation, since $\mathbf{z}_0\in E$ by construction. This yields

$$ \frac{1}{\alpha}\ \Delta_1^T\, V\, D\, V^T \Delta_1 = 1 $$

that is,

$$ \alpha = \Delta_1^T\, H_0 \Delta_1 $$

where $\Delta_1 = \mathbf{z}_1 - \mathbf{z}_0$.

The idea is now to display each Newton step by drawing the isocontour $E_k$ through $\mathbf{z}_k$ at every iteration. For notational simplicity, redefine

$$ \mathbf{z}_0 = \mathbf{z}_k \;\;\;\;\text{and}\;\;\;\; \mathbf{z}_1 = \mathbf{z}_{k+1} \;\;\;\;\text{and}\;\;\;\; E = E_k\;. $$

To draw $E$, we use the function `matplotlib.patches.Ellipse`

(read the manual), which requires the center of $E$, the angle $\theta$ formed by the first axis of $E$ with the horizontal axis of the reference system, and the *full* lengths `width`

and `height`

of the axes (not the half-lengths!).

The point $\mathbf{z}_0$ is on the ellipse $E$, and the point $\mathbf{z}_1$ is its center.

From the definitions of $R$ and $V$, we see that (using Python subscripts to avoid confusion)

$$ V_{00} = \cos\theta\;\;\;\;\text{and}\;\;\;\; V_{10} = \sin\theta $$

so that we can find $\theta$ by using the function `numpy.arctan2`

:

```
theta = numpy.arctan2(V[1][0], V[0][0])
```

However, this function returns $\theta$ in radians and in the interval $(-\pi, \pi]$, while `matplotlib.patches.Ellipse`

requires an angle in degrees and in the interval $[0, 360)$, so a conversion is needed.

Write a function with header

```
def ellipse(z0, z1, Hessian, color='red'):
```

that takes two consecutive points `z0`

and `z1`

on a Newton optimization path and a function `Hessian`

that computes the Hessian of some function from $\mathbb{R}^2$ to $\mathbb{R}$ and returns the ellipse $E$ associated with that step.

Show your code and display the ellipse for the first Newton step of your optimization in the previous part.

Use the default color for the ellipse.

Display your ellipse on top of a display of the contours of the Rosenbrock function for values of the two components of $\mathbf{z}$ between $-4$ and $3$. This set of contours is in the same style as the one you drew in homework 5, and is computed by the function

`RosenContours`

in`hw6.py`

. Look at the code to see what this function returns.Also draw a red dot at $\mathbf{z}_0$ and a red cross (

`x`

) at $\mathbf{z}_1$.

Make a new plot that displays the ellipses for the first four steps of your Newton optimization path you found in the previous part of this assignment.

Show your code and the resulting plot.

The ellipses are in a single diagram, on top of the countour diagram produced by

`RosenContours`

.The four ellipses should be in red, green, blue, and orange, in this order.

For each ellipse, superimpose a dot at $\mathbf{z}_k$ and a cross (

`x`

) at $\mathbf{z}_{k+1}$, both in the same color as the ellipse. This will cause some dots and crosses for different ellipses to be drawn in the same position. This is intended.

The file `hw6.py`

also defines a function `makeData`

that returns a training set `T`

and a test set `S`

if you call

```
T, S = hw6.makeData()
```

The data points are image of hand-written digits, just as in a previous assignment. However, only the digits 4 and 9 are used, so the classification problem is binary. Digit 4 is given label 0, and digit 9 is given label 1.

Section 2 of the class notes on binary linear predictors defines the cross-entropy loss, the score function and risk function for a binary logistic-regression classifier with parameter vector $\mathbf{v}\in\mathbb{R}^m$ where $m = d+1$ and $\mathbf{x}\in\mathbb{R}^d$. The notes in that Section also define the gradient and Hessian of the risk. Read and understand that Section carefully.

Define functions with headers

```
def lossXE(y, p):
def score(x, v):
```

that implement the cross-entropy loss and score function for a logistic-regression classifier.

Show your code, and a single diagram that shows the two plots of `lossXE(0, p)`

and `lossXE(1, p)`

for p between $10^{-3}$ and $1 - 10^{-3}$.

Use natural logarithms.

If the score is too close to either 0 or 1, then the cross-entropy loss may become undefined. Also, if the argument to the score function is too small or too large, the

`exp`

function can cause numerical overflow or underflow. To prevent both problems, if

$$ a = b + \mathbf{w}^T\mathbf{x} $$

in the `score`

function, then clip the value of a between -10 and 10:

`a = max(-10, min(10, a))`

Define functions with headers

```
def riskXE(v):
def riskXEGrad(v):
def riskXEHess(v):
```

that implement the cross-entropy risk on set `T`

(loaded by `hw6.makeData()`

), its gradient, and Hessian, following the formulas in the notes.

Then run your own `Newton`

to train the optimal parameter vector $\mathbf{v}$ on `T`

, starting with a vector $\mathbf{v}_0$ of all zeros.

Show your code and give the number of iterations to convergence.

For compatibility with

`Newton`

, these functions use`T`

as defined in the environment, rather than passing`T`

as a parameter. This is not best coding practice, because now`T`

is hard-wired in the functions and cannot be changed, but simplifies your task.Use the default values for the keyword arguments of

`Newton`

, except that`normalize`

is set to`True`

(more on this point in a later problem).Remember that

`Newton`

outputs the whole history, and this makes it easy to tell the number of iterations.Better numerical methods exist. This problem has a purely pedagogical value.

The solution is obtained with a variant of Newton's method that normalizes the value of $\mathbf{z}$ at every iteration, because the parameter `normalize`

is set to `True`

. In this application of the method, the unknown vector $\mathbf{z}$ really represents $\mathbf{v}$, the vector of parameters of the logistic regressor.

Normalization changes the method quite substantially, and some of the guarantees of Newton's method no longer apply. We ignore this issue in this assignment.

Why do we normalize the value of $\mathbf{v}$ at every iteration? Answer this question by explaining why the norm of $\mathbf{v}$ is meaningless for this problem.

Write functions with headers

```
def h(x, v):
def risk01(v, S):
```

that implement the logistic-regression classifier $h$ with parameter $\mathbf{v}$ and the zero-one risk on set $S$.

Show your code and the values of *zero-one* training and test risk for the optimal parameter vector $\mathbf{v}^{\ast}$ you found in the previous problem. Express these risks as *percentage* values.