# Optimization by Gradient Descent

This post is part of a series of notes on machine learning.

This post is literate Haskell; you can load the source into GHCi and play along.

Boilerplate.

```
module GradientDescent where
import Control.Applicative
import Test.QuickCheck hiding (Function)
import Test.QuickCheck.Test
import Indices
import IndexIsos
import Tensors
import TensorFunctions
import Gradients
import GradientChecking
```

Now that we know how to find the gradient of a function, we can use it to optimize the function using *gradient descent*. The intuition here is that the gradient of a function at a point gives the direction of steepest ascent there, so the negative of the gradient gives the direction of steepest descent – the direction an \(n\)-dimensional marble would roll if released from rest.

The basic method goes something like this: we start with an initial guess \(v_0\), then update the guess using the recurrence \[v_{n+1} = v_n - \alpha \nabla(f)(v_n),\] where \(f\) is the function to be minimized and \(\alpha\) is some number (not necessarily constant) called the *learning rate*. At some point we decide to stop, and the last update \(v_n\) should be at least a local minimum of \(f\).

This paragraph is simple enough, but a lot of questions remain. How is \(\alpha\) determined, and should it change between iterations? How do we decide when to stop? These are questions without unique answers. But we can fill in the structure of the gradient descent algorithm without knowing them.

Say we have some type `st`

representing the “state” of a gradient descent process at a given time. Say further that we have some function `st -> r`

that gives the learning rate, and a function `Tensor r -> st -> Maybe st`

that decides to stop the iteration (by returning `Nothing`

) or to keep going (by returning a new state). Then gradient descent looks something like this.

```
gradDesc
:: (Num r)
=> Function r -- gradient
-> st -- initial state
-> Tensor r -- initial guess
-> (Tensor r -> st -> Maybe st) -- update state or stop
-> (st -> r) -- learning rate
-> Tensor r
gradDesc g st0 x0 update rate =
let
next st v = v .- ((rate st) .@ (g $@ v))
step st x =
let y = next st x in
case update y st of
Nothing -> y
Just w -> step w y
in
step st0 x0
```

To turn this into an actual gradient descent algorithm, we need some specific functions we can plug in for the update and learning rate. For instance, it’s reasonable to use a fixed learning rate.

And one particularly simple (if not super useful) state might be to keep track of the number of steps taken so far, and stop when that number reaches a limit.

```
fixedNumSteps
:: Int -> Tensor r -> Int -> Maybe Int
fixedNumSteps m _ k = if k >= m
then Nothing
else Just (k+1)
```

Now if `f`

is a tensor function and `v`

a tensor,

Carries out gradient descent for 100 steps with learning rate 0.5. A more practical update function keeps track of the last iteration and stops when the difference between it and the current value is less than some threshold.

```
maxAbsDiffLessThan
:: (Eq r, Ord r, Num r, Fractional r, RealFloat r)
=> r -> Tensor r -> Tensor r -> Maybe (Tensor r)
maxAbsDiffLessThan eps x y =
let
del = distanceBy MaxAbsDiff x y
in
if del < eps
then Nothing
else Just x
```

As a really simple example, fix a vector \(v \in \mathbb{R}^s\) and consider the function \(f : \mathbb{R}^s \rightarrow \mathbb{R}^s\) given by \[f(x)_i = (x_i - v_i)^2.\] This is quadratic in each coordinate with minimum at \(x_i = v_i\). Moreover this function is convex, so that minimum is unique, and gradient descent should find it regardless of the initial guess.

```
_test_grad_desc_line
:: (Eq r, Ord r, Num r, Fractional r, Floating r, RealFloat r, Show r, Arbitrary r)
=> r -> Test (Size -> Property)
_test_grad_desc_line r =
testName "grad desc line" $
\u -> u ~/= 0 ==>
forAll2 (arbTensorOf r u) (arbTensorOf r u) $
\v x0 ->
let
g = (idF u) .- (constF u v)
w = gradDesc g x0 x0
(maxAbsDiffLessThan (10**(-8)))
(fixedLearningRate 0.7)
in
(distanceBy MaxAbsDiff v w) < (10**(-6))
```

## Tests

```
_test_gradient_descent
:: (Eq r, Ord r, Num r, Fractional r, Floating r, RealFloat r, Show r, Arbitrary r)
=> r -> Int -> Int -> IO ()
_test_gradient_descent r num size = do
testLabel "Gradient Descent"
let
args = stdArgs
{ maxSuccess = num
, maxSize = size
}
runTest args (_test_grad_desc_line r)
main_gradient_descent :: IO ()
main_gradient_descent = _test_gradient_descent (0::Double) 100 5
```