Unverified Commit cfbda0e9 authored by Simon Bowly's avatar Simon Bowly
Browse files

update regularisation

parent 3e49c1fb
%% Cell type:markdown id: tags:
# Regularization in Linear Regression
# Regularisation in Linear Regression
In this lesson we will investigate the effect of various features on the tendency of patients to develop diabetes, using the `sklearn` methods `LinearRegression` and `Ridge` and `Lasso` regularization. We will look at how coefficients can vary signficantly due to overfitting, and how that can be alleviated using regularization.
In this notebook we will explore the technique of **regularisation** in linear regression models. We have seen how coefficients can vary signficantly due to overfitting; regularisation is a method which can alleviate this problem.
First import the Diabetes dataset. The same dataset can be imported from the `sklearn` example datasets, but is already normalized. We will use the unnormalized dataset initially, to show why data needs to be normalised before modelling.
We will use the Diabetes dataset to explore these methods, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb). Please review the content there first.
%% Cell type:code id: tags:
``` python
import pandas as pd
......@@ -18,13 +18,13 @@
import seaborn as sns
```
%% Cell type:markdown id: tags:
## Recall
## Recall: Multivariate Linear Regression
Read data, check correlations, construct normalised regression model, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb).
Following on from previous work on multilinear regression, here we read in the data, check correlations, construct normalised features, and fit a regression model.
%% Cell type:code id: tags:
``` python
df = pd.read_csv('Diabetes_Data.csv') # read the Diabetes dataset in to a pandas dataframe
......@@ -71,34 +71,11 @@
print("Testing score is",rsquared_linear)
```
%% Cell type:markdown id: tags:
Again we can plot the linear regression coefficients, but this time we compare them against our original linear regression coefficients to investigate the variability. It can now be seen that the effect of the blood serum measurements can have considerable variability.
%% Cell type:code id: tags:
``` python
# create a new dataframe with the regression coefficients from the normalised data
ncoefs = pd.DataFrame(linear.coef_.transpose(),columns=['Normalised'],index=feature_names)
# add our original coefficient importance to this dataframe
# ncoefs = pd.concat([ncoefs,coefs],axis=1)
# ncoefs.columns =['Normalised','Original'] # change the column names to show the new and original coefficients
# do a similar horizontal plot as before
ax = ncoefs.plot(kind='bar',figsize=(10,7))
plt.title('Linear Regression')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
To investigate the variability we can use the `sklearn` methods `cross_validate` and `RepeatedKFold`. The first of these performs a number of runs of a model. The second splits the data in n sections and repeats the calculations m times. This gives n.m runs to investigate the variability of the coefficients. The variability of these can then be plotted using a boxplot.
With this initial model fitted, we investigate coefficient variability using the `sklearn` methods `cross_validate` and `RepeatedKFold`. The first of these performs a number of runs of a model. The second splits the data in n sections and repeats the calculations m times. This gives n.m runs to investigate the variability of the coefficients. The variability of these can then be plotted using a boxplot.
We can see `AGE`, `SEX`, `BP`, `BMI` and `S6` have very low variance, whereas `S1`-`S5` have high variance due to overfitting. Of the low variance features, `AGE` and `S6` seem to have little effect on the predictions.
%% Cell type:code id: tags:
......@@ -121,19 +98,26 @@
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
### Regularization
## Regularization
We now investigate regularization techniques for Linear Regression, to reduce the variance of the model. How does this work?
* Adds a penalty term to the cost function, which penalises large coefficient values for all features (todo add math model).
* Hence, if the model fitting algorithm can reduce the model coefficient sizes while achieving similar error rates, it will. This results in a more stable model.
* It's important to regularise coefficients so that this effect is uniform across all coefficients.
* There are two main types: Ridge and Lasso.
* To achieve a good result, we need to experiment with the model parameter *alpha* which controls the balance between the error measure and the penalty term in the cost function.
We now investigate regularization techniques for Linear Regression, to reduce the variance of the model.
#### Ridge regularization
### Ridge regularization
To use Ridge regularization (which adds a penalty term which is proportional to the sum of the squares of the coefficients), we need to find the optimal value of tuning parameter alpha. Generally this would be done using `RidgeCV`, however here we will graphically compare the training and testing scores. What we want to determine is the value of alpha for which we obtain the maximum value of the testing score. To generate the figure we create an array of alpha values, which in this case are logarithmically distributed, and perform a Ridge regularization for each, and store the testing and training scores ( $R^2$ ). Then we plot these against alpha. From the figure we see the optimal value occurs at alpha approximately 20.
%% Cell type:code id: tags:
......@@ -163,11 +147,11 @@
plt.legend(loc='best');
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
Now can investigate Ridge regularization using the optimal value of alpha approximately 20. Now we see significantly reduced variance in the coefficient and that the most important variables are `BMI`, `BP` and `S5`.
......@@ -190,15 +174,15 @@
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
#### Lasso regularization
### Lasso regularization
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value is alpha approximately 2.
%% Cell type:code id: tags:
......@@ -226,11 +210,11 @@
plt.legend(loc='best');
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
Again, it can be seen that the most significant variables are `BMI`, `BP` and `S5`. In this case the coefficients for `AGE`, `S2` and `S4` are zero or close to zero.
......@@ -253,11 +237,11 @@
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
### Multicollinearity
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment