Unverified Commit 81028126 authored by Simon Bowly's avatar Simon Bowly
Browse files

regularisation final

parent cfbda0e9
%% Cell type:markdown id: tags:
# Regularisation in Linear Regression
In this notebook we will explore the technique of **regularisation** in linear regression models. We have seen how coefficients can vary signficantly due to overfitting; regularisation is a method which can alleviate this problem.
In this notebook we will explore the technique of **regularisation** in linear regression models. We have seen how coefficients can vary signficantly across different subsets of data due to overfitting; regularisation is a method which can alleviate this problem, leading to more consistent results.
We will use the Diabetes dataset to explore these methods, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb). Please review the content there first.
We will use the Diabetes dataset to explore these methods, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb). Please review the content there if this is not familiar.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split # for splitting the data into training and testing sets
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, RidgeCV # models we are going to use
from sklearn.linear_model import LinearRegression, Lasso, Ridge # models we are going to use
from sklearn.model_selection import cross_validate, RepeatedKFold
from sklearn.metrics import r2_score # for comparing the predicted and test values
import seaborn as sns
```
%% Cell type:markdown id: tags:
## Recall: Multivariate Linear Regression
Following on from previous work on multilinear regression, here we read in the data, check correlations, construct normalised features, and fit a regression model.
Following on from previous work on multilinear regression, we read in the data, check correlations, construct normalised features, and fit a regression model.
%% Cell type:code id: tags:
``` python
df = pd.read_csv('Diabetes_Data.csv') # read the Diabetes dataset in to a pandas dataframe
......@@ -78,12 +79,10 @@
We can see `AGE`, `SEX`, `BP`, `BMI` and `S6` have very low variance, whereas `S1`-`S5` have high variance due to overfitting. Of the low variance features, `AGE` and `S6` seem to have little effect on the predictions.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_validate, RepeatedKFold # import sklearn methods
rng = np.random.RandomState(1) # make sure the results are repeatable
# cross_validate takes the particular model, in this case linear regression which we instantatiated earlier,
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
......@@ -98,24 +97,38 @@
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
## Regularization
## Regularisation
What we observe in the results above is significant *variance* in the fitted model. The term *variance* has dual meanings here:
1. The model coefficients vary significantly when fitted to different subsets of the data (this is what we are testing using RepeatedKFold), and
2. The error rates vary significantly between training and testing (note the R^2 values above are 0.56 for training and 0.45 for testing).
We'll now investigate regularisation techniques for Linear Regression, to reduce the variance of the model. In general, a regularisation method adds a term to the cost function used by a model-fitting algorithm which penalises large model coefficient values. Recall that a linear regression model attempts to minimise the cost function:
$$
\sum_{i \in N} \left( y^{\text{predicted}}_i - y^{\text{actual}}_i \right)^2
$$
Typically we use a gradient descent algorithm to fit the model, and this has the effect of minimising the root mean squared error (RMSE) on the training data. A regularised model adjusts this cost function to:
$$
\sum_{i \in N} \left( y^{\text{predicted}}_i - y^{\text{actual}}_i \right)^2 + \alpha \sum_{k \in K} \theta_k^2
$$
We now investigate regularization techniques for Linear Regression, to reduce the variance of the model. How does this work?
where $\theta_k$ are the values of the model coefficients.
* Adds a penalty term to the cost function, which penalises large coefficient values for all features (todo add math model).
* Hence, if the model fitting algorithm can reduce the model coefficient sizes while achieving similar error rates, it will. This results in a more stable model.
* It's important to regularise coefficients so that this effect is uniform across all coefficients.
* There are two main types: Ridge and Lasso.
* To achieve a good result, we need to experiment with the model parameter *alpha* which controls the balance between the error measure and the penalty term in the cost function.
This new cost function represents two objectives; minimising the error rate on the training data, and minimising the variance of the model. The weighting between these two objectives is controlled by the parameter alpha ($\alpha$). We have to choose this parameter ourselves, and if chosen correctly, the fitting algorithm should reduce the size of the model coefficients, while maintaining a similar error rate.
There are two main types of regularisation: Ridge and Lasso. We'll compare both below. To achieve a good result, we'll need to experiment with the model parameter *alpha* which controls the balance between the error measure and the penalty term in the cost function. Also, note that it is important to normalise the feature data so that the effect of regularisation is applied uniformly across all coefficients in the model.
### Ridge regularization
To use Ridge regularization (which adds a penalty term which is proportional to the sum of the squares of the coefficients), we need to find the optimal value of tuning parameter alpha. Generally this would be done using `RidgeCV`, however here we will graphically compare the training and testing scores. What we want to determine is the value of alpha for which we obtain the maximum value of the testing score. To generate the figure we create an array of alpha values, which in this case are logarithmically distributed, and perform a Ridge regularization for each, and store the testing and training scores ( $R^2$ ). Then we plot these against alpha. From the figure we see the optimal value occurs at alpha approximately 20.
......@@ -147,11 +160,11 @@
plt.legend(loc='best');
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
Now can investigate Ridge regularization using the optimal value of alpha approximately 20. Now we see significantly reduced variance in the coefficient and that the most important variables are `BMI`, `BP` and `S5`.
......@@ -174,17 +187,17 @@
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
### Lasso regularization
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value is alpha approximately 2.
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value for alpha is approximately 2.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1) # make sure the results are repeatable
......@@ -210,11 +223,11 @@
plt.legend(loc='best');
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
Again, it can be seen that the most significant variables are `BMI`, `BP` and `S5`. In this case the coefficients for `AGE`, `S2` and `S4` are zero or close to zero.
......@@ -237,27 +250,15 @@
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
![]()
%% Cell type:markdown id: tags:
### Multicollinearity
You will notice that several of the regression coefficient distributions differ significantly between the original linear model and the models regularised using Lasso or Ridge regression. The key examples here are `S1` to `S4`; in particular observe that S3 has gone from a positive effect to a negative effect which S2 and S4 have little to no effect.
We can look back at the correlation table to explain why this change occurs: it is due to multicollinearity. The correlation table shows that some of our input variables (`S1` to `S4`) have relatively strong correlations with one another. When we fit a linear model to input data containing two correlated variables A and B, we can produce multiple models with equivalent error by increasing the coefficient of A, while decreasing the coefficient of B to counteract the effect. Introducing regularisation tends to reduce this effect by introducing a penalty for the training algorithm's tendency to increase the coefficient values. The results of these regularised models indicate that we could potentially remove `S2` and `S4` from the model; they are strongly correlated with `S1` and `S3` respectively so do not contribute useful information.
%% Cell type:markdown id: tags:
### Feature Selection
These results indicate that we could remove some unimportant/confounding features from the model in order to reduce variance. We have identified here that BMI, BP and S5 are the features with the highest importance. It may be worth investigating whether we can achieve good predictions on the test set by using a model which only uses these features.
Interestingly, this does align with what we saw in the correlation coefficients - BMI, BP and S5 do have the strongest correlations with Y.
Including 'confounding' features in a model can be problematic. They may result in overfitting to the training data, and lead us to observe an effect that is actually not there. They can also inflate the importance of other variables in the case of multi-collinearty.
This, it is important to conduct some trials of models with different subsets of features. Correlation tables work well as a guide here, but ultimately we need to assess the effect of model changes on the error measure for the **test** set.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment