# Gradient Descent for linear regression task

## Data pre processing

- Load data from csv file to a list

1 | trainDataList = list() |

make all train data in one list, to enable iterating per hour of the list, thus, we have

`(24 * days - 10 + 1)`

training data.`trainDataIteratedPerHour: list()`

1 | # n lines to be one day |

- build
`x_data`

and`y_data`

1 | """ |

- draw plot (to feel the range of the data)

1 | # draw plot of pm2.5 range? |

## Loss function

1 | def lossFunction(w, x_data, y_data): |

## Iterations for grident descent

1 | # Iterations |

Run it for 10 iterations

1 | lr = 0.0000000000001 # learning rate |

## Tunning

### Find good initial learning rate

start with `lr = 0.0000000000001`

1 | In [19]: lr = 0.0000000000001 # learning rate |

hmmm… try greater lr

`lr = 0.00000000001`

The loss function is descenting faster, like around 60000 per epoch.

1 | In [21]: lr = 0.000000000001 # learning rate |

our training set data is `5750`

1 | In [20]: len(y_data) |

If prediected pm 2.5 value , a.k.a y data is in error range 10, the L value should be `5750*100=575000`

With current descent speed, we will need 88 epoch:

1 | In [24]: (5840184-575000)/60000 |

let’s make it 10 times faster to see if target L could be get in 10 epoch

`lr = 0.0000000001`

1 | In [27]: lr = 0.0000000001 # learning rate |

1 | # plot it |

As we printed in first step only, we can see now the learning rate is way faster in initial step thus the first L printed is in smaller range!

And we also know it’s descenting slower after 10 ephoch, thus it’s not reaching our target `575000`

in small steps.

I (for sure) know that I need to use adaptive learning rate and customised learning rate per feature(adagrade/ adam), while before that, I would like to see how it goes with smaller learning rate, while it goes to mess!

1 | In [46]: lr = 0.000000001 # learning rate |

### lr = `1e-10`

Then we could back to last initial learning rate, let’s get the final `w`

by adding `print (w)`

in `iterationRun()`

, also we add initial w as an argument to enable modified initial `w`

1 | def iterationRun(lr,iteration,x_data,y_data, w = [0.01] * 162): |

Then after re-run, we got the `w`

:

1 | In [49]: lr = 0.0000000001 # learning rate |

### Run 10~20 epoch, lr=`1e-10`

Another 10 epoch( continued from first 10 epoch):

1 | #w = [] output value in initial 10 epoch |

The result is not bad! let’s continue

### Run 20-30 epoch, lr=`1e-10`

another 10 epoch

1 | lr = 1e-10 # learning rate |

Let’s zoom to last 25 epoch

We still have 1060000-575000 = 485000, and now te speed is like 3000 descent per epoch, which means we could reach target in 160 epoch if it keeps the same speed :-) (for sure it won’t).

### 30~200 epoch, lr=`1e-10`

When it run to 200 epoch, the L reached `809150.2256672443`

### 207~388 epoch, lr=`2e-10`

This time let’s give lr as doubled, the initial speed is faster while it will end up with very slow, and after 180 epoch, L is `703824.3536660115`

.

1 | In [105]: def iterationRun(lr,iteration,x_data,y_data, w = [0.01] * 162): |

### 388~ 391 epoch, lr =`1e-8`

By tunning lr as we did in begining, it’s found `1e-8`

can descent the L faster while `1e-7`

will lead the value go mess:

1 | In [135]: lr = 1e-8 # learning rate |

Let’s do it with lr=`1e-8`

, for another 20 epoch:

In the 0th~7th , L was normal while after from 4th epoch, the L goes crazy…

1 | In [144]: lr = 1e-8 # learning rate |

below is a record for

win epoch 391

1 | In [171]: lr = 1e-8 # learning rate |

### 392~395 epoch, lr=`12-9`

Let’s make w as the 3th epoch, and try tunning the lr smaller to `1e-9`

1 | In [174]: w = W[2] |

This is really strange, while we got the w with L = `5429166.075156927`

which looks overfitting( I guess ).

1 | In [190]: lossFunction(w391,x_data,y_data) |

We could verify both w391 and w395 with public test data.

## Verify w391 and w395 with testdata

### Processing test data

#### test.csv

1 | # verify with test data test.csv |

#### ans.csv

1 | # ans.csv |

### Verify data

It looks like the w391 is our best output :-) for now, and the w396 is overfitting!

1 | In [86]: lossFunction(w391,x_data_test,y_answer) |

Draw the plot

1 | y_w391 = np.array(y_data_w391) |

The plot is as below

# Start studying all other gradient descent alogrithm

ref:http://ruder.io/optimizing-gradient-descent

ref:https://www.slideshare.net/SebastianRuder/optimization-for-deep-learning

ref:https://zhuanlan.zhihu.com/p/22252270

## Adagrad

### Descent speed

1 | import numpy as np |

It’s basically the same, the difference is only:

- in the very begining, ADAGRAD went to crazy field, while it self corrected to normal path soon
- ADAGRAD will be converging slower than BGD …
- our BGD was actually human tunned one, which means ADAGRAD is easier to find themself the correct path (no need for human invention)

Also we could see the initial 30 epoch:

1 | In [8]: initialN = 30 |

### Verify the result

1 | # loss function compare |

Check the prediction result figure

1 | y_data_w391 = [y(w391,x) for x in x_data_test ] |