Unverified Commit 3e49c1fb authored by Simon Bowly's avatar Simon Bowly
Browse files

Regularization draft

parent b11b975b
%% Cell type:markdown id: tags:
# Regularization in Linear Regression
In this lesson we will investigate the effect of various features on the tendency of patients to develop diabetes, using the `sklearn` methods `LinearRegression` and `Ridge` and `Lasso` regularization. We will look at how coefficients can vary signficantly due to overfitting, and how that can be alleviated using regularization.
First import the Diabetes dataset. The same dataset can be imported from the `sklearn` example datasets, but is already normalized. We will use the unnormalized dataset initially, to show why data needs to be normalised before modelling.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split # for splitting the data into training and testing sets
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, RidgeCV # models we are going to use
from sklearn.metrics import r2_score # for comparing the predicted and test values
import seaborn as sns
```
%% Cell type:markdown id: tags:
## Recall
Read data, check correlations, construct normalised regression model, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb).
%% Cell type:code id: tags:
``` python
df = pd.read_csv('Diabetes_Data.csv') # read the Diabetes dataset in to a pandas dataframe
corrs = df.corr() # calculate the correlation table
# as this is a symmetric table, set up a mask so that we only plot values below the main diagonal
mask = np.triu(np.ones_like(corrs, dtype=bool))
f, ax = plt.subplots(figsize=(10, 8)) # initialise the plots and axes
# plot the correlations as a seaborn heatmap, with a colourbar
sns.heatmap(corrs, mask=mask, center=0, annot=True, square=True, linewidths=.5)
# do some fiddling so that the top and bottom are not obscured
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5);
```
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
X = df.drop(['Y'],axis=1) # drop Y from our dataframe
Y = df['Y'] # create a dataframe with just the Y values
nX =(X-X.mean())/X.std() # create nX, a normalised version of X
feature_names = X.columns
```
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1) # make sure the results are repeatable
# split into a training set with 80% of the data, and a testing set as the remainder
X_train, X_test, Y_train, Y_test = train_test_split(nX,Y,test_size=0.8)
linear = LinearRegression() # instantatiate the linear regression model
linear.fit(X_train,Y_train) # fit the data to the model
training_score = linear.score(X_train,Y_train) # calculate rsq for the training set
# use the independent variables for the testing set to predict the target variable
preds_linear = linear.predict(X_test)
# calculate the correlation of the predicted and actual target variables
rsquared_linear = r2_score(Y_test,preds_linear)
# print the training and testing scores
print("Training score is",training_score)
print("Testing score is",rsquared_linear)
```
%% Cell type:markdown id: tags:
Again we can plot the linear regression coefficients, but this time we compare them against our original linear regression coefficients to investigate the variability. It can now be seen that the effect of the blood serum measurements can have considerable variability.
%% Cell type:code id: tags:
``` python
# create a new dataframe with the regression coefficients from the normalised data
ncoefs = pd.DataFrame(linear.coef_.transpose(),columns=['Normalised'],index=feature_names)
# add our original coefficient importance to this dataframe
# ncoefs = pd.concat([ncoefs,coefs],axis=1)
# ncoefs.columns =['Normalised','Original'] # change the column names to show the new and original coefficients
# do a similar horizontal plot as before
ax = ncoefs.plot(kind='bar',figsize=(10,7))
plt.title('Linear Regression')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
To investigate the variability we can use the `sklearn` methods `cross_validate` and `RepeatedKFold`. The first of these performs a number of runs of a model. The second splits the data in n sections and repeats the calculations m times. This gives n.m runs to investigate the variability of the coefficients. The variability of these can then be plotted using a boxplot.
We can see `AGE`, `SEX`, `BP`, `BMI` and `S6` have very low variance, whereas `S1`-`S5` have high variance due to overfitting. Of the low variance features, `AGE` and `S6` seem to have little effect on the predictions.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_validate, RepeatedKFold # import sklearn methods
rng = np.random.RandomState(1) # make sure the results are repeatable
# cross_validate takes the particular model, in this case linear regression which we instantatiated earlier,
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
scores = cross_validate(linear, nX, Y, cv=RepeatedKFold(n_splits=5, n_repeats=5), return_estimator=True)
# take the results for each simulation (estimator), extract the coefficients for each run
# and add them to a dataframe with columns being the feature names
coefs = pd.DataFrame([est.coef_ for est in scores['estimator']],columns=feature_names)
# plot the descriptive statics of the coefficients in a box and whisker plot to show variability
ax = coefs.plot(kind='box',figsize=(10,7))
plt.title('Linear Regression')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
### Regularization
We now investigate regularization techniques for Linear Regression, to reduce the variance of the model.
#### Ridge regularization
To use Ridge regularization (which adds a penalty term which is proportional to the sum of the squares of the coefficients), we need to find the optimal value of tuning parameter alpha. Generally this would be done using `RidgeCV`, however here we will graphically compare the training and testing scores. What we want to determine is the value of alpha for which we obtain the maximum value of the testing score. To generate the figure we create an array of alpha values, which in this case are logarithmically distributed, and perform a Ridge regularization for each, and store the testing and training scores ( $R^2$ ). Then we plot these against alpha. From the figure we see the optimal value occurs at alpha approximately 20.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1) # make sure the results are repeatable
# create an array of 21 alpha values logarithmically distributed between 10**(-2) and 10**2
alfas = np.logspace(-2,2,num=21)
# create two arrays for storage of the same size as alfas, but filled with zeros
ridge_training_score = np.zeros_like(alfas)
ridge_rsquared = np.zeros_like(alfas)
# loop over the values in the alfas array, at each loop the current value is alfa
# and idx is incremented by 1, starting at 0
for idx, alfa in enumerate(alfas):
ridge = Ridge(alpha=alfa) # instantatiate Ridge regularization with the current alfa
ridge.fit(X_train,Y_train) # train the model to our data set
# calculate the training score and store in the array ridge_training_score
ridge_training_score[idx] = ridge.score(X_train,Y_train)
preds_linear = ridge.predict(X_test) # calculate the model prediction for the test data
# calculate the correlation between the predicted and actual test data and store in the array ridge_rsquared
ridge_rsquared[idx] = r2_score(Y_test,preds_linear)
plt.scatter(alfas,ridge_training_score,label='Training score') # plot the training score against alpha
plt.scatter(alfas,ridge_rsquared,label='Testing score') # plot the testing score against alpha
plt.xscale('log') # make the x-axis a logarithmic scale
plt.gca().set_xlim(left=.01, right=100); # fix the x-axis limits
plt.title('Ridge regularization')
plt.xlabel('alpha');
plt.legend(loc='best');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Now can investigate Ridge regularization using the optimal value of alpha approximately 20. Now we see significantly reduced variance in the coefficient and that the most important variables are `BMI`, `BP` and `S5`.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1) # make sure the results are repeatable
# cross_validate takes the particular model, in this case ridge regularization
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
scores = cross_validate(Ridge(alpha=20.), nX, Y, cv=RepeatedKFold(n_splits=5, n_repeats=5), return_estimator=True)
# take the results for each simulation (estimator), extract the coefficients for each run
# and add them to a dataframe with columns being the feature names
coefs = pd.DataFrame([est.coef_ for est in scores['estimator']],columns=feature_names)
# plot the descriptive statics of the coefficients in a box and whisker plot to show variability
ax = coefs.plot(kind='box',figsize=(10,7))
plt.title('Ridge regression, alpha = 20.')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
#### Lasso regularization
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value is alpha approximately 2.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1) # make sure the results are repeatable
# create two arrays for storage of the same size as alfas, but filled with zeros
lasso_training_score = np.zeros_like(alfas)
lasso_rsquared = np.zeros_like(alfas)
# loop over the values in the alfas array, at each loop the current value is alfa
# and idx is incremented by 1, starting at 0
for idx, alfa in enumerate(alfas):
lasso = Lasso(alpha=alfa) # instantatiate Lasso regularization with the current alfa
lasso.fit(X_train,Y_train) # train the model to our data set
# calculate the training score and store in the array lasso_training_score
lasso_training_score[idx] = lasso.score(X_train,Y_train)
preds_linear = lasso.predict(X_test) # calculate the model prediction for the test data
# calculate the correlation between the predicted and actual test data and store in the array lasso_rsquared
lasso_rsquared[idx] = r2_score(Y_test,preds_linear)
plt.scatter(alfas,lasso_training_score,label='Training score') # plot the training score against alpha
plt.scatter(alfas,lasso_rsquared,label='Testing score') # plot the testing score against alpha
plt.xscale('log') # make the x-axis a logarithmic scale
plt.gca().set_xlim(left=.01, right=100); # fix the x-axis limits
plt.title('Lasso regularization')
plt.xlabel('alpha');
plt.legend(loc='best');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Again, it can be seen that the most significant variables are `BMI`, `BP` and `S5`. In this case the coefficients for `AGE`, `S2` and `S4` are zero or close to zero.
%% Cell type:code id: tags:
``` python
rng = np.random.RandomState(1) # make sure the results are repeatable
# cross_validate takes the particular model, in this case lasso regularization
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
scores = cross_validate(Lasso(alpha=2.), nX, Y, cv=RepeatedKFold(n_splits=5, n_repeats=5), return_estimator=True)
# take the results for each simulation (estimator), extract the coefficients for each run
# and add them to a dataframe with columns being the feature names
coefs = pd.DataFrame([est.coef_ for est in scores['estimator']],columns=feature_names)
# plot the descriptive statics of the coefficients in a box and whisker plot to show variability
ax = coefs.plot(kind='box',figsize=(10,7))
plt.title('Lasso regression, alpha = 2.')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
### Multicollinearity
You will notice that several of the regression coefficient distributions differ significantly between the original linear model and the models regularised using Lasso or Ridge regression. The key examples here are `S1` to `S4`; in particular observe that S3 has gone from a positive effect to a negative effect which S2 and S4 have little to no effect.
We can look back at the correlation table to explain why this change occurs: it is due to multicollinearity. The correlation table shows that some of our input variables (`S1` to `S4`) have relatively strong correlations with one another. When we fit a linear model to input data containing two correlated variables A and B, we can produce multiple models with equivalent error by increasing the coefficient of A, while decreasing the coefficient of B to counteract the effect. Introducing regularisation tends to reduce this effect by introducing a penalty for the training algorithm's tendency to increase the coefficient values. The results of these regularised models indicate that we could potentially remove `S2` and `S4` from the model; they are strongly correlated with `S1` and `S3` respectively so do not contribute useful information.
%% Cell type:markdown id: tags:
### Feature Selection
These results indicate that we could remove some unimportant/confounding features from the model in order to reduce variance. We have identified here that BMI, BP and S5 are the features with the highest importance. It may be worth investigating whether we can achieve good predictions on the test set by using a model which only uses these features.
Interestingly, this does align with what we saw in the correlation coefficients - BMI, BP and S5 do have the strongest correlations with Y.
Including 'confounding' features in a model can be problematic. They may result in overfitting to the training data, and lead us to observe an effect that is actually not there. They can also inflate the importance of other variables in the case of multi-collinearty.
This, it is important to conduct some trials of models with different subsets of features. Correlation tables work well as a guide here, but ultimately we need to assess the effect of model changes on the error measure for the **test** set.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment