When you submit your work, follow the instructions on the submission workflow page * scrupulously* for full credit.

**Important: Failure to do any of the following will result in lost points:**

Submit

**one**PDF file and**one**notebook per groupEnter

**all the group members**in your PDF Gradescope submission, not just in your documents. Look at this Piazza Post if you don't know how to do thisMatch each

**answer**(not question!) with the appropriate page in GradescopeAvoid large blank spaces in your PDF file

The gradient of a function $f(\mathbf{z})\ :\ \mathbb{R}^d \rightarrow \mathbb{R}$ is typically viewed as a column vector in the literature:

$$ \nabla f(\mathbf{z}) = \left[\begin{array}{c} \frac{\partial f}{\partial z_1} \\ \vdots \\ \frac{\partial f}{\partial z_d} \end{array}\right] \;. $$

A more natural (and not uncommon) view of the gradient is as a row vector, because then the gradient is merely a special case (for $e=1$) of the *Jacobian matrix* of a function $\mathbf{f}(\mathbf{z})\ :\ \mathbb{R}^d \rightarrow \mathbb{R}^e$, defined as the following $e\times d$ matrix:

$$ J_{\mathbf{f}}(\mathbf{z}) = \left[\begin{array}{ccc} \frac{\partial f_1}{\partial z_1} & \ldots & \frac{\partial f_1}{\partial z_d} \\ \vdots & & \vdots \\ \frac{\partial f_e}{\partial z_1} & \ldots & \frac{\partial f_e}{\partial z_d} \end{array}\right]\;. $$

The Jacobian matrix can therefore be viewed as the column vector of gradients of the components $f_1,\ldots, f_e$ of the function $\mathbf{f}$, one gradient per row.

The distinction between gradient as a row vector and gradient as a column vector is relevant in mathematics, but is of course moot if vectors are represented as `numpy`

arrays in Python, because these arrays do not distinguish between row vectors and column vectors.

The transformation from polar coordinates $\mathbf{z} = (r, \phi)$ to cartesian coordinates $(x, y)$ on the plane can be written as follows:

$$ \begin{eqnarray*} x &=& r \cos\phi\\ y &=& r \sin\phi \end{eqnarray*} $$

Write a formula for the Jacobian $J_{P2C}$ of this transformation.

Write functions with headers

```
def P2C(z):
```

and

```
def JacobianP2C(z):
```

that compute the transformation above and its Jacobian matrix. Inputs and outputs should be `numpy`

arrays.

Show your code. You will test it in a later problem.

In homework 4, we explored the concept of gradient and Hessian in a case in which the function $f(\mathbf{z})\ :\ \mathbb{R}^2 \rightarrow \mathbb{R}$ was known analytically. This is the situation students are most familiar with, because that's what calculus courses emphasize.

In practice, however, $f$ is often known either (i) through a piece of code in some programming language, or, even more opaquely, (ii) as a black box: A program that can be called at will with an input $\mathbf{z}$ and produces some output $y$. We will explore options for scenario (i) in a later assignment. In this part of this assignment, we take the black box route: We know nothing about $f$ except that it is differentiable (or else we cannot differentiate it!). Of course, this situation is general and subsumes scenario (i) by just forgoing any analysis of the code. However, we will see later that having access to the code opens other, interesting avenues.

By definition of (partial) derivative, if $f(\mathbf{z})\ :\ \mathbb{R}^d \rightarrow \mathbb{R}$, we can write

$$ \frac{\partial f}{\partial z_i} = \lim_{\delta\rightarrow 0} \frac{f(\mathbf{z} + \delta \mathbf{e}_i) - f(\mathbf{z})}{\delta} $$ where $\mathbf{e}_i$ is the $i$-th column of the $d\times d$ identity matrix. That is, $\mathbf{e}_i$ is a vector of $d$ zeros except for a one in position $i$.

We can approximate the limit above by picking a small value of $\delta$ and disposing of the limit:

$$ \frac{\partial f}{\partial z_i} \approx \frac{f(\mathbf{z} + \delta \mathbf{e}_i) - f(\mathbf{z})}{\delta} \;. $$

For reasons that would take too long to get into, a much better numerical approximation is provided by the so-called *central difference*

$$ \frac{\partial f}{\partial z_i} \approx \frac{f(\mathbf{z} + \delta \mathbf{e}_i) - f(\mathbf{z} - \delta \mathbf{e}_i)}{2\delta} \;. $$

There is a whole literature on how to choose $\delta$ appropriately: Too large, and we are too far from the limit. Too small, and numerical errors in the computation prevail. We take a pragmatic approach in this introductory exploration and use $\delta=10^{-5}$ most of the time. More on this aspect later.

Write a Python function with header

```
def Jacobian(f, z, delta=1e-5):
```

that takes a function `f`

from $\mathbb{R}^d$ to $\mathbb{R}^e$, a `numpy`

vector `z`

with $d$ entries, and an optional value for $\delta$ and returns a `numpy`

array with the Jacobian of `f`

at `z`

, using the central difference formula given above.

Show your code. You will test it in the next problem.

Your function `Jacobian`

must work for any function `f`

from $\mathbb{R}^{d}$ to $\mathbb{R}^e$ for any $d>0$ and $e>0$. The function `f`

takes a `numpy`

vector with $d$ entries as input and produces a `numpy`

vector with $e$ entries. The function `Jacobian`

outputs a `numpy`

array of shape $(e, d)$ (not $(d, e)$!).

For later use, it is important that `Jacobian`

return a `numpy`

array in all cases (even when the result is a scalar).

Show the result of running the tests below. This will happen automatically once your functions `Jacobian`

, `P2C`

, and `JacobianP2C`

, are defined correctly (and you run the cell below).

These tests use the default value for $\delta$.

In [1]:

```
def compare(a, f, b, delta=1e-5):
def a2s(a):
def n2s(x):
return '{:g}'.format(round(x, 4))
try:
return '[' + '; '.join([', '.join([n2s(y) for y in row]) for row in a]) + ']'
except TypeError:
try:
return '[' + ', '.join([n2s(y) for y in a]) + ']'
except TypeError:
return '[]' if a.size == 0 else n2s(a)
aName, fName, bName = a.__name__, f.__name__, b.__name__
msgBase = '{:s}({:s}, {{:s}}) = {{:s}}\n{:s}({{:s}}) = {{:s}}'
msg = msgBase.format(aName, fName, bName)
zs = np.array([[0, 0], [1, 0], [2, 1], [2, 2]])
for z in zs:
print(msg.format(a2s(z), a2s(a(f, z, delta)), a2s(z), a2s(b(z))), end='\n\n')
try:
compare(Jacobian, P2C, JacobianP2C)
except NameError:
pass
```

The Hessian is the Jacobian of the gradient, in the following sense. If the gradient of $f$ is a row vector,

$$ \mathbf{g}^T = \nabla f(\mathbf{z}) = \left[\begin{array}{ccc} \frac{\partial f}{\partial z_1} & \ldots & \frac{\partial f}{\partial z_d} \end{array}\right] \;, $$

the Hessian

$$ H_f(\mathbf{z}) = \left[\begin{array}{ccc} \frac{\partial f^2}{\partial z_1^2} & \ldots & \frac{\partial f^2}{\partial z_1 z_d} \\ \vdots & & \vdots \\ \frac{\partial f^2}{\partial z_d z_1} & \ldots & \frac{\partial f^2}{\partial z_d^2} \end{array}\right] $$

can be written as follows:

$$ H_f(\mathbf{z}) = \left[\begin{array}{c} \frac{\partial \mathbf{g}^T}{\partial z_1} \\ \vdots \\ \frac{\partial \mathbf{g}^T}{\partial z_d} \end{array}\right] \;. $$

Use the fact that the Hessian is the Jacobian of the gradient to write a Python function with header

```
def Hessian(f, x, delta=1e-5):
```

that uses your `gradient`

function to compute the Hessian of `f`

at `x`

. Show your code.

Your function must work for a function $f\ :\ \mathbb{R}^{d}\rightarrow \mathbb{R}$ for any $d$, not just for $d=2$. However, the codomain of $f$ has dimension $e=1$ now.

You will get full credit if your code is correct and uses `Jacobian`

to fullest advantage.

Make sure that the value of `delta`

is propagated to all calls to `Jacobian`

.

Show the result of running the tests below. This will happen automatically once your function `Hessian`

is defined correctly (and you run the cell below).

The tests use a function `f`

which is the same as in homework 4, and is given below. They also use a function `gradientF`

, given below, which computes the gradient of `f`

through the analytic formula (as you did in homework 4). These tests use the default value for $\delta$.

In [2]:

```
shift, scale = [2, 1], 10
def f(z):
d = z - shift
return np.array(np.inner(z, z) / scale + np.exp(-np.inner(d, d)))
def gradientF(z):
d = z - shift
return 2 * (z / scale - d * np.exp(-np.inner(d, d)))
def HessianF(z):
I = np.eye(2)
d = z - shift
return 2 * (I / scale + (2 * np.outer(d, d) - I) * np.exp(-np.inner(d, d)))
try:
compare(Jacobian, f, gradientF)
compare(Hessian, f, HessianF)
except NameError:
pass
```

The default value for $\delta$ happens to work for the examples above. This is because all functions involved are "tame," in the sense that their values have comparable magnitudes. Even then, however, the choice of $\delta$ is not inconsequential.

Write one clear and concise sentence to describe which results are good and which are not in the tests below.

There are of course better methods for choosing $\delta$ than just trying some value. In particular, the choice needs to adapt to the range of values $f(\mathbf{x})$ encountered during the computations.

However, the experiment in this problem should at least serve as a warning that numerical differentiation can be tricky to get right. A later assignment will explore a different technique for computing derivatives, called *algorithmic differentiation* or *automatic differentiation*. Variants of this techniques are used in many packages that implement deep learning methods. Stay tuned.

In [3]:

```
try:
delta = 1e-9
compare(Jacobian, f, gradientF, delta)
compare(Hessian, f, HessianF, delta)
except NameError:
pass
```

In the steepest descent algorithm for the minimization of a function $f(\mathbf{z})\ :\ \mathbb{R}^d\rightarrow \mathbb{R}$, the new value $\mathbf{z}_{k+1}$ of $\mathbf{z}$ is found from the current value $\mathbf{z}_k$ by a technique called *line search*.

Specifically, given the search direction

$$ \mathbf{p}_k = - \nabla f(\mathbf{z}_k)\;, $$

line search defines the function $h\ :\ \mathbb{R} \rightarrow \mathbb{R}$ as

$$ h(\alpha) = f(\mathbf{z}_k + \alpha \mathbf{p}_k) $$

and finds a local minimum of $h(\alpha)$ for $\alpha > 0$. The line search function returns the corresponding point $\mathbf{z}_{k+1}$.

The class notes show an iterative bracketing technique to perform line search. A first stage finds an inital bracketing triple $(a, b, c)$, and a second stage shrinks the triple.

In this part, you will implement and test the steepest descent algorithm. This implies that *you are not allowed to use any existing function that implements steepest descent*, either exactly or substantively.

However, you *are* allowed to use the function `scipy.optimize.minimize_scalar`

, which implements the shrinkage part of line search, as explained below.

Using the imports and definition in the cell below, write a function with header

```
def lineSearch(f, g, z0):
```

that performs line search on the function `f`

, whose gradient is computed by the function `g`

, starting at point `z0`

. If the starting point $\mathbf{z}_0$ is in $\mathbb{R}^d$, then functions $f$ and $g$ have the following signatures:

$$ f\ :\ \mathbb{R}^d \rightarrow \mathbb{R} \;\;\;\;\;\text{and}\;\;\;\;\; g\ :\ \mathbb{R}^d \rightarrow \mathbb{R}^d $$

Show your code, and the result of running the function with the function $f$ and value `z0`

defined below. Defining the corresponding gradient $g$ is your task.

In [4]:

```
from scipy import optimize as opt
import numpy as np
import math
import matplotlib.pyplot as plt
small = math.sqrt(np.finfo(float).eps)
f, z0 = lambda z: -np.sin(z), 0
```

If the magnitude of the gradient of $f$ at $\mathbf{z}_0$ is smaller than `small`

(defined above), then `lineSearch`

should just return `z0`

without further work. Otherwise, `lineSearch`

should return the new value of $\mathbf{z}$ it found (not $\alpha$).

The class notes state that the third element $c$ of the initial bracketing triple can be found by starting at some small value and then increasing $c$ until $h(c)$ is greater than $h$ evaluated at the previous value of $c$. In doing so, start at $c =$ `small`

and increase $c$ every time by a multiplicative factor 1.2:

```
c *= 1.2
```

A factor of 2 would work as well, but 1.2 is a bit safer.

Allow for a maximum value $c_{\max} = 100$ for $c$, and report failure if that value is exceeded. This should not happen in this assignment.

It will be convenient to use a two-item tuple for the initialization of the keyword parameter `bracket`

for the function `scipy.optimize.minimize_scalar`

. If `bracket`

is the pair $(a, c)$, then the function will call another function (`scipy.optimize.bracket`

) that finds a bracketing triple $(a, b, c)$, as defined in the class notes. So your function `lineSearch`

first finds $(a, c)$ (hint: $a = 0$), and then calls `scipy.optimize.minimize_scalar`

to compute the result of line search. Read the `scipy`

manual to understand how to use the result from `scipy.optimize.minimize_scalar`

.

Write a function with header

```
def steepest(f, g, z0, maxK=10000, delta=small, remember=False):
```

that uses your `lineSearch`

function to implement steepest descent with line search.

Show your code and the value of $\mathbf{z}$ that results from minimizing the provided function `Rosenbrock`

with `steepest`

. Start the minimization at $\mathbf{z}_0 = (-1.2, 1)$ (defined below), and use the default values for the keyword arguments.

The gradient of the function `Rosenbrock`

is also provided as function `RosenGrad`

below. The function `Rosenbrock`

has a true minimum at $\mathbf{z}^{\ast} = (1, 1)$ (defined as `zStar`

below).

Minimization stops when either the norm of the difference between consecutive values of $\mathbf{z}$ differ by less than `delta`

or when the number of iterations equals or exceeds `maxK`

.

If the argument `remember`

is `False`

, the function `steepest`

returns the value of $\mathbf{z}$ it found. If `remember`

is `True`

, the function returns a pair `(z, history)`

. The first item of the pair is still $\mathbf{z}$, and the second item is a $(m+1)\times d$ `numpy.array`

that records the value traversed by every step of steepest descent, including $\mathbf{z}_0$. Here, $m$ is the number of steps and $\mathbf{z}\in\mathbb{R}^d$.

One step is defined as one call to `lineSearch`

. This implies that `history`

is *not* to record the intermediate steps taken within the call to `lineSearch`

.

Be patient, as the search may take a while.

In [5]:

```
def Rosenbrock(z):
return 100 * (z[1] - z[0] ** 2) ** 2 + (1 - z[0]) ** 2
def RosenGrad(z):
return np.array([400 * (z[0] ** 3 - z[0] * z[1]) + 2 * z[0] - 2,
200 * (z[1] - z[0] ** 2)])
z0, zStar = np.array([-1.2, 1]), np.array([1, 1])
```

Now run `steepest`

as follows (the `try`

/`except`

is there so that the notebook will still run if `steepest`

is undefined).

In [6]:

```
try:
[zHat, history] = steepest(Rosenbrock, RosenGrad, z0, maxK=1000, remember=True)
except NameError:
pass
```

Make a contour plot of the Rosenbrock function using 10 levels drawn as thin gray lines (use `colors='grey', linewidths=1`

in your call to `matplotlib.pyplot.contour`

to this effect) for $-1.5 \leq z_1 \leq 1.5$ and $-0.5 \leq z_2 \leq 1.5$. To make the contour plot, sample this rectangle of values with 100 samples in each dimension.

Superimpose a plot of the path recorded in `history`

on the contour plot. Also draw a blue dot at $\mathbf{z}_0$ and a red dot at $\mathbf{z}^{\ast}$, the true minimum.

Show code and plot.

Convergence slows down as $\mathbf{z}^{\ast}$ is approached, and even after 1000 iterations there is still a way to go. To see this slowdown more clearly, plot a vector `distance`

with the Euclidean distances between each of the points in `history`

(obtained in the previous problem) and $\mathbf{z}^{\ast}$. Label the plot axes meaningfully.

Show code and plot.