Commit f335e6d8 authored by Simon Clarke's avatar Simon Clarke
Browse files

Finalised plotting notebook

parent 64f69a62
%% Cell type:markdown id: tags:
## Multiple Linear Regression - Underfitting and Overfitting
* Add formatting
* Show coefficients for initial examples
* Show accuracy vs order of polynomial
* Plot against testing set and show accuracy
* Use PolynonialFeatures to create Vandermonde matrix (don't need to calculate intercept)
* Explain CV in more detail
* Emphasise bias and variance
In this exercise we will investigate how polynomial regression can be implemented using `sklearn`, the dangers of underfitting and overfitting data, and how overfitting can be addressed using the `Ridge` and `Lasso` regularization techniques.
Initially, we import some libraries that we are going to need.
%% Cell type:code id: tags:
``` python
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import warnings
warnings.filterwarnings('ignore')
```
%% Cell type:markdown id: tags:
We define a function that we will use to fit the polynomials to.
%% Cell type:code id: tags:
``` python
def true_fun(X):
return np.cos(.15*np.pi*X)
```
%% Cell type:markdown id: tags:
Now define a training set `x` and `y`, and a testing set `x_test` and `y_test`. Both of these have a small amount of noise added to the function.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
nsamp = 30
x = 10*rng.rand(nsamp)
y = true_fun(x)+.1*rng.randn(nsamp)
x_test = 10*rng.rand(nsamp)
y_test = true_fun(x_test)+.1*rng.randn(nsamp)
```
%% Cell type:markdown id: tags:
The easiest way to fit a polynomial our data set is using the numpy routine `polyfit`. This minimizes the least squares error, the same as for linear regression. `polyfit` calculates the coefficients of the polynomial of degree n, starting from the highest coefficient and descending to the constant term. The routine `poly1d` takes the coefficients of a polynomial, as given by `polyfit`, and generates a one-dimensional polynomial function. Here we calculate these polynomials for degrees 0, 1, 4 and 15.
%% Cell type:code id: tags:
``` python
p0 = np.poly1d(np.polyfit(x, y, 0))
p1 = np.poly1d(np.polyfit(x, y, 1))
p4 = np.poly1d(np.polyfit(x, y, 4))
p15 = np.poly1d(np.polyfit(x, y, 15))
```
%% Cell type:markdown id: tags:
We now plot these polynomials against the training set and the true function.
The zeroth and first order polynomials (constant and linear functions) can be seen to be underfitted, due to the fact that there is insufficient complexity in the model. Such models are said to have high *bias* and low *variance*. For any randomly selected training set on this interval they will give very similar results.
The fifteenth order polynomial suffers from overfitting. The polynomial passes through most of the points in the training set, but in the intervals between these points there are very large variations. For another randomly selected training set the results would be noticeably different. Consequently the model is said to have high *variance*. This can be addressed by reducing the complexity of the model, or increasing the size of the training set.
The quartic can be seen to perform very well, and balances *bias* and *variance*. It is apparent that this model has optimal complexity for the data set. However, how do we pick the optimal complexity for our model? This is where machine learning comes in. Here we will show how this can be achieved using regularization.
%% Cell type:code id: tags:
``` python
xfit = np.linspace(0,10,1000)
plt.scatter(x,y,label="samples")
plt.plot(xfit,true_fun(xfit),label="Model")
plt.plot(xfit,p0(xfit),label="Zeroth order")
plt.plot(xfit,p1(xfit),label="Line")
plt.plot(xfit,p4(xfit),label="Quartic")
plt.plot(xfit,p15(xfit),label="Fifteenth order")
plt.legend(loc="best")
plt.xlim((0,10))
plt.ylim((-2,2))
```
%%%% Output: execute_result
(-2, 2)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Now we perform polynomial regression using `sklearn`. In ADS1001 we saw how to do this with the COVID-19 dataset. Here we will again show how to undertake multiple linear regression, but in this case, applied to fitting polynomials to our data.
We introduce three linear models from `sklearn`. `RidgeCV` and `LassoCV` are just cross-validation forms forms of `Ridge` and `Lasso`. We also introduce a function to calculate $R^2$. Recall that this lies between 0 and 1, and for the best models, this will be close to 1.
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV, LassoCV
from sklearn.metrics import r2_score
```
%% Cell type:markdown id: tags:
To implement linear regression, we instatiate the model and fit our training data set `x` and `y`. This will give a slope and intercept for the line of best fit, and we can then find the predicted values for this model, `y_pred`, for our values `x_test`. The success of the model can be found be comparing the predicted values with the actual values for the test set, `y_test`. `np.newaxis` allows us to convert a vector to a matrix with one column.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
model = LinearRegression(fit_intercept=True)
model.fit(x[:,np.newaxis],y)
yfit = model.predict(xfit[:,np.newaxis])
y_pred = model.predict(x_test[:,np.newaxis])
rsquared_linear = r2_score(y_pred,y_test)
plt.scatter(x,y)
plt.plot(xfit,yfit)
plt.plot(xfit,true_fun(xfit))
plt.title("Linear Regression Rsq=%f" % (rsquared_linear));
```
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
print("Model slope: ",model.coef_[0])
print("Model intercept:",model.intercept_)
```
%%%% Output: stream
Model slope: -0.2079715661158376
Model intercept: 0.8608441005248983
%% Cell type:markdown id: tags:
To undertake polynomial regression is a bit more complex.
First we need to define a Vandermonde matrix. Assume we have a vector of x values given by `[x1,x2,...,xn]`, then the entry in row `i` and column `j` of the Vandermonde matrix will be `(xi)**(m-(j-1))`, where the two asterisks denote exponentiation. Therefore the first row will be `[(x1)**m,(x1)**(m-1), ...,x1,1]`. Hence the Vandermonde matrix has `n` data points and `m+1` features for each data point. Since the last column only has ones, it actually only has `m` independent features. Note that the argument we pass to `vander` is the number of columns of the matrix, which is one more than the order of the polynomial. A simple example of a Vandermonde matrix of order 3 is shown below.
%% Cell type:code id: tags:
``` python
z = [0,1,2,3]
Z = np.vander(z, 4)
print(Z)
```
%%%% Output: stream
[[ 0 0 0 1]
[ 1 1 1 1]
[ 8 4 2 1]
[27 9 3 1]]
%% Cell type:markdown id: tags:
Here our order of the polynomial is `reg_ord=4`, i.e., a quartic.
The Vandermonde matrix can now be passed to our linear regression model to perform multiple linear regression. Last semester we did this to perform linear regression in multiple dimensions. Here we are doing a similar thing, but performing in multiple independent functions, where the functions are `x**m`, `x**(m-1)`, ..., `x`.
Once we have the coefficients of the model, we can use the model in the same way to give predicted values for our testing set, and compare them against the actual target values for the training set.
Note that we could also perform polynomial regression using the sklearn routines `PolynomialFeatures` and `make_pipeline`. This is a bit more compact, but the process used here provides a better understanding of polynomial regression.
Here we see that $R^2$ is very close to 1, which shows that a quartic is a very good model for this data.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
reg_ord = 4;
model = LinearRegression(fit_intercept=True)
model.fit(np.vander(x,reg_ord+1),y)
yfit = model.predict(np.vander(xfit,reg_ord+1))
y_pred = model.predict(np.vander(x_test,reg_ord+1))
rsquared_quartic = r2_score(y_test,y_pred)
plt.scatter(x,y)
plt.plot(xfit,yfit)
plt.plot(xfit,true_fun(xfit))
plt.title("Quartic Regression Rsq=%f" % (rsquared_quartic));
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
We can view the coefficients and intercept of the model. We don't show the last coefficient as this is 0.
%% Cell type:code id: tags:
``` python
print("Model coefficients: ",model.coef_[0:reg_ord])
print("Model intercept: ",model.intercept_)
quadcoefs = np.log10(abs(model.coef_[0:reg_ord]))
```
%%%% Output: stream
Model coefficients: [-0.00072192 0.02470839 -0.1935915 0.10302281]
Model intercept: 0.9851448822164288
%% Cell type:markdown id: tags:
We can repeat the process for a fifteenth order polynomial by only changing `reg_ord`. In this case the correlation coefficient decreases significantly, due the overfitting by the model.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
reg_ord = 15;
model = LinearRegression(fit_intercept=True)
model.fit(np.vander(x,reg_ord+1),y)
yfit = model.predict(np.vander(xfit,reg_ord+1))
y_pred = model.predict(np.vander(x_test,reg_ord+1))
rsquared = r2_score(y_test,y_pred)
plt.scatter(x,y)
plt.plot(xfit,yfit)
plt.plot(xfit,true_fun(xfit))
plt.title("Fifteenth Order Regression Rsq=%f" % (rsquared));
```
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
print("Model coefficients: ",model.coef_[0:reg_ord])
print("Model intercept: ",model.intercept_)
```
%%%% Output: stream
Model coefficients: [-4.14069383e-09 2.67243929e-07 -7.59025023e-06 1.24484879e-04
-1.29856424e-03 8.92203106e-03 -4.01965591e-02 1.12765767e-01
-1.66680480e-01 3.98261521e-02 1.73205727e-01 -7.16871971e-03
-2.30528064e-01 -2.13933576e-01 2.26906291e-01]
Model intercept: 0.9905320987616388
%% Cell type:code id: tags:
``` python
lincoefs = np.log10(abs(model.coef_[0:reg_ord]));
```
%% Cell type:markdown id: tags:
One manifestation of overfitting for polynomial regression is that the coefficients of the polynomial do not decrease in magnitude sufficiently rapidly. For the fifteenth order polynomial above recall that the largest value of `x` is 10, so the magnitude of the largest term in the polynomial is $10^{-9}\times 10^{15} = 10^6$. Regularization aims to reduce the magnitude of these coefficents by modifying the function that is minimized.
Ridge regularization adds the term $\alpha(\theta_1^2+\theta_2^2+...+\theta_m^2)$, where $\theta_1, \theta_2, \dots, \theta_m$ are the coefficients of the polynomial, to the sum of the square of the errors. This penalty can be solved for in a similar time as for linear regression.
As can be seen this improves the fit of the polynomial.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
reg_ord = 15;
model = Ridge(alpha=2.,fit_intercept=True)
model.fit(np.vander(x,reg_ord+1),y)
yfit = model.predict(np.vander(xfit,reg_ord+1))
y_pred = model.predict(np.vander(x_test,reg_ord+1))
rsquared = r2_score(y_test,y_pred)
plt.scatter(x,y)
plt.plot(xfit,yfit)
plt.plot(xfit,true_fun(xfit))
plt.title("Fifteenth Order Ridge Regression Rsq=%f" % (rsquared));
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Lasso regularization adds the term $\alpha(\left|\theta_1\right|+\left|\theta_2\right|+...+\left|\theta_m\right|)$, to the sum of the square of the errors. This penalty has a more marked effect on the coefficients than Ridge regularization, but takes longer to solve for.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
reg_ord = 15;
model = Lasso(alpha=.005,fit_intercept=True)
model.fit(np.vander(x,reg_ord+1),y)
yfit = model.predict(np.vander(xfit,reg_ord+1))
y_pred = model.predict(np.vander(x_test,reg_ord+1))
rsquared = r2_score(y_test,y_pred)
plt.scatter(x,y)
plt.plot(xfit,yfit)
plt.plot(xfit,true_fun(xfit))
plt.title("Fifteenth Order Lasso Regression Rsq=%f" % (rsquared));
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
To find the optimal value of alpha we can use Cross Validation, where a range of test alpha values are passed to the regularization routine.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
model = RidgeCV(alphas=np.linspace(.5,2.5,9),fit_intercept=True)
reg_ord = 15;
model.fit(np.vander(x,reg_ord+1),y)
print('Optimal alpha for Ridge regression is:',model.alpha_)
model = Ridge(alpha=model.alpha_,fit_intercept=True)
model.fit(np.vander(x,reg_ord+1),y)
y_pred = model.predict(np.vander(x_test,reg_ord+1))
rsquared_quartic = round(r2_score(y_test,y_pred),3)
print('Rsq is:',rsquared_quartic)
```
%%%% Output: stream
Optimal alpha for Ridge regression is: 1.75
Rsq is: 0.949
%% Cell type:markdown id: tags:
As expected, the magnitude of the higher order coefficients is reduced for Ridge regularization.
%% Cell type:code id: tags:
``` python
print("Model coefficients: ",model.coef_[0:reg_ord])
print("Model intercept: ",model.intercept_)
```
%%%% Output: stream
Model coefficients: [-4.71079149e-11 1.55095947e-09 -7.49672969e-09 -2.77564369e-07
3.56367578e-06 8.16007427e-06 -4.35554548e-04 3.45642495e-03
-1.18739354e-02 1.43194832e-02 1.13435881e-02 -1.79591125e-02
-4.13556658e-02 -4.48295802e-02 -2.85268236e-02]
Model intercept: 1.0118880446769136
%% Cell type:code id: tags:
``` python
ridgecoefs = np.log10(abs(model.coef_[0:reg_ord]));
```
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1)
model = LassoCV(alphas=np.linspace(.005,.025,9),fit_intercept=True)
reg_ord = 15;
model.fit(np.vander(x,reg_ord+1),y)
print('Optimal alpha for Lasso regression is:',model.alpha_)
model = Lasso(alpha=model.alpha_,fit_intercept=True)
model.fit(np.vander(x,reg_ord+1),y)
y_pred = model.predict(np.vander(x_test,reg_ord+1))
rsquared_quartic = round(r2_score(y_test,y_pred),3)
print('Rsq is:',rsquared_quartic)
```
%%%% Output: stream
Optimal alpha for Lasso regression is: 0.0175
Rsq is: 0.971
%% Cell type:markdown id: tags:
And for Lasso regularization, but notice the higher order coefficients are now significantly smaller.
%% Cell type:code id: tags:
``` python
print("Model coefficients: ",model.coef_[0:reg_ord])
print("Model intercept: ",model.intercept_)
```
%%%% Output: stream
Model coefficients: [ 7.99121170e-15 -4.01741495e-14 -3.52693457e-13 -2.88757259e-12
-2.09003790e-11 -1.13669877e-10 -5.03997684e-11 1.11104975e-08
2.23063594e-07 3.09190722e-06 3.31485403e-05 2.25859167e-04
-6.98648239e-04 -5.83604448e-02 -8.49169967e-02]
Model intercept: 0.9964539826394814
%% Cell type:code id: tags:
``` python
lassocoefs = np.log10(abs(model.coef_[0:reg_ord]));
plt.scatter(np.arange(1,5),quadcoefs[::-1])
plt.scatter(np.arange(1,16),lassocoefs[::-1])
plt.scatter(np.arange(1,16),ridgecoefs[::-1])
plt.scatter(np.arange(1,16),lincoefs[::-1])
```
%%%% Output: execute_result
<matplotlib.collections.PathCollection at 0x1a1e06d690>
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
Date,Total doses,Doses per 100 people
21/02/2021,20,0
22/02/2021,1836,0
23/02/2021,2787,0.01
24/02/2021,6926,0.03
25/02/2021,16622,0.06
26/02/2021,23489,0.09
27/02/2021,27627,0.11
28/02/2021,29571,0.12
01/03/2021,33702,0.13
02/03/2021,41922,0.16
03/03/2021,48565,0.19
04/03/2021,61008,0.24
05/03/2021,71000,0.28
06/03/2021,76073,0.3
07/03/2021,81000,0.32
08/03/2021,86369,0.34
09/03/2021,94908,0.37
10/03/2021,106200,0.41
11/03/2021,125000,0.49
12/03/2021,135109,0.53
13/03/2021,159638,0.62
14/03/2021,162551,0.63
15/03/2021,164781,0.64
16/03/2021,178546,0.7
17/03/2021,203557,0.79
18/03/2021,226057,0.88
19/03/2021,240754,0.94
20/03/2021,253831,0.99
21/03/2021,256782,1
22/03/2021,281950,1.1
23/03/2021,312225,1.22
24/03/2021,358500,1.4
25/03/2021,408410,1.59
26/03/2021,460155,1.79
27/03/2021,474128,1.85
28/03/2021,507000,1.97
29/03/2021,541761,2.11
30/03/2021,597525,2.33
31/03/2021,670351,2.61
01/04/2021,744330,2.9
02/04/2021,823613,3.21
03/04/2021,828059,3.22
04/04/2021,843182,3.28
05/04/2021,844379,3.29
06/04/2021,855294,3.33
07/04/2021,920645,3.58
08/04/2021,996456,3.88
09/04/2021,1078004,4.2
10/04/2021,1138866,4.43
11/04/2021,1166583,4.54
12/04/2021,1178302,4.59
13/04/2021,1234681,4.81
14/04/2021,1295672,5.04
15/04/2021,1359305,5.29
16/04/2021,1420577,5.53
17/04/2021,1474537,5.74
18/04/2021,1496608,5.83
19/04/2021,1586252,6.18
20/04/2021,1653286,6.44
21/04/2021,1718107,6.69
22/04/2021,1785698,6.95
23/04/2021,1855601,7.22
24/04/2021,1914047,7.45
25/04/2021,1934087,7.53
26/04/2021,1937485,7.54
27/04/2021,1969337,7.67
28/04/2021,2029544,7.9
29/04/2021,2112285,8.22
30/04/2021,2179544,8.49
01/05/2021,2234844,8.7
02/05/2021,2254074,8.78
03/05/2021,2260615,8.8
04/05/2021,2316969,9.02
05/05/2021,2396314,9.33
06/05/2021,2473529,9.63
07/05/2021,2554531,9.94
08/05/2021,2627725,10.23
09/05/2021,2654338,10.33
10/05/2021,2663221,10.37
11/05/2021,2732249,10.64
\ No newline at end of file
This diff is collapsed.