In this lesson we will investigate the effect of various features on the tendency of patients to develop diabetes, using the `sklearn` methods `LinearRegression` and `Ridge` and `Lasso` regularization. We will look at how coefficients can vary signficantly due to overfitting, and how that can be alleviated using regularization.
In this notebook we will explore the technique of **regularisation** in linear regression models. We have seen how coefficients can vary signficantly due to overfitting; regularisation is a method which can alleviate this problem.
First import the Diabetes dataset. The same dataset can be imported from the `sklearn` example datasets, but is already normalized. We will use the unnormalized dataset initially, to show why data needs to be normalised before modelling.
We will use the Diabetes dataset to explore these methods, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb). Please review the content there first.
%% Cell type:code id: tags:
``` python
importpandasaspd
importnumpyasnp
frommatplotlibimportpyplotasplt
fromsklearn.model_selectionimporttrain_test_split# for splitting the data into training and testing sets
fromsklearn.linear_modelimportLinearRegression,Lasso,Ridge,LassoCV,RidgeCV# models we are going to use
fromsklearn.metricsimportr2_score# for comparing the predicted and test values
importseabornassns
```
%% Cell type:markdown id: tags:
## Recall
## Recall: Multivariate Linear Regression
Read data, check correlations, construct normalised regression model, as we did in the [multi-linear regression notebook](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/blob/main/Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb).
Following on from previous work on multilinear regression, here we read in the data, check correlations, construct normalised features, and fit a regression model.
%% Cell type:code id: tags:
``` python
df=pd.read_csv('Diabetes_Data.csv')# read the Diabetes dataset in to a pandas dataframe
corrs=df.corr()# calculate the correlation table
# as this is a symmetric table, set up a mask so that we only plot values below the main diagonal
mask=np.triu(np.ones_like(corrs,dtype=bool))
f,ax=plt.subplots(figsize=(10,8))# initialise the plots and axes
# plot the correlations as a seaborn heatmap, with a colourbar
linear=LinearRegression()# instantatiate the linear regression model
linear.fit(X_train,Y_train)# fit the data to the model
training_score=linear.score(X_train,Y_train)# calculate rsq for the training set
# use the independent variables for the testing set to predict the target variable
preds_linear=linear.predict(X_test)
# calculate the correlation of the predicted and actual target variables
rsquared_linear=r2_score(Y_test,preds_linear)
# print the training and testing scores
print("Training score is",training_score)
print("Testing score is",rsquared_linear)
```
%%%% Output: stream
Training score is 0.4905649691705233
Testing score is 0.4012836055759741
Training score is 0.5639053207650624
Testing score is 0.4390426658027451
%% Cell type:markdown id: tags:
Again we can plot the linear regression coefficients, but this time we compare them against our original linear regression coefficients to investigate the variability. It can now be seen that the effect of the blood serum measurements can have considerable variability.
%% Cell type:code id: tags:
``` python
# create a new dataframe with the regression coefficients from the normalised data
To investigate the variability we can use the `sklearn` methods `cross_validate` and `RepeatedKFold`. The first of these performs a number of runs of a model. The second splits the data in n sections and repeats the calculations m times. This gives n.m runs to investigate the variability of the coefficients. The variability of these can then be plotted using a boxplot.
With this initial model fitted, we investigate coefficient variability using the `sklearn` methods `cross_validate` and `RepeatedKFold`. The first of these performs a number of runs of a model. The second splits the data in n sections and repeats the calculations m times. This gives n.m runs to investigate the variability of the coefficients. The variability of these can then be plotted using a boxplot.
We can see `AGE`, `SEX`, `BP`, `BMI` and `S6` have very low variance, whereas `S1`-`S5` have high variance due to overfitting. Of the low variance features, `AGE` and `S6` seem to have little effect on the predictions.
We now investigate regularization techniques for Linear Regression, to reduce the variance of the model. How does this work?
* Adds a penalty term to the cost function, which penalises large coefficient values for all features (todo add math model).
* Hence, if the model fitting algorithm can reduce the model coefficient sizes while achieving similar error rates, it will. This results in a more stable model.
* It's important to regularise coefficients so that this effect is uniform across all coefficients.
* There are two main types: Ridge and Lasso.
* To achieve a good result, we need to experiment with the model parameter *alpha* which controls the balance between the error measure and the penalty term in the cost function.
We now investigate regularization techniques for Linear Regression, to reduce the variance of the model.
#### Ridge regularization
### Ridge regularization
To use Ridge regularization (which adds a penalty term which is proportional to the sum of the squares of the coefficients), we need to find the optimal value of tuning parameter alpha. Generally this would be done using `RidgeCV`, however here we will graphically compare the training and testing scores. What we want to determine is the value of alpha for which we obtain the maximum value of the testing score. To generate the figure we create an array of alpha values, which in this case are logarithmically distributed, and perform a Ridge regularization for each, and store the testing and training scores ( $R^2$ ). Then we plot these against alpha. From the figure we see the optimal value occurs at alpha approximately 20.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# create an array of 21 alpha values logarithmically distributed between 10**(-2) and 10**2
alfas=np.logspace(-2,2,num=21)
# create two arrays for storage of the same size as alfas, but filled with zeros
ridge_training_score=np.zeros_like(alfas)
ridge_rsquared=np.zeros_like(alfas)
# loop over the values in the alfas array, at each loop the current value is alfa
# and idx is incremented by 1, starting at 0
foridx,alfainenumerate(alfas):
ridge=Ridge(alpha=alfa)# instantatiate Ridge regularization with the current alfa
ridge.fit(X_train,Y_train)# train the model to our data set
# calculate the training score and store in the array ridge_training_score
Now can investigate Ridge regularization using the optimal value of alpha approximately 20. Now we see significantly reduced variance in the coefficient and that the most important variables are `BMI`, `BP` and `S5`.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# cross_validate takes the particular model, in this case ridge regularization
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value is alpha approximately 2.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# create two arrays for storage of the same size as alfas, but filled with zeros
lasso_training_score=np.zeros_like(alfas)
lasso_rsquared=np.zeros_like(alfas)
# loop over the values in the alfas array, at each loop the current value is alfa
# and idx is incremented by 1, starting at 0
foridx,alfainenumerate(alfas):
lasso=Lasso(alpha=alfa)# instantatiate Lasso regularization with the current alfa
lasso.fit(X_train,Y_train)# train the model to our data set
# calculate the training score and store in the array lasso_training_score
Again, it can be seen that the most significant variables are `BMI`, `BP` and `S5`. In this case the coefficients for `AGE`, `S2` and `S4` are zero or close to zero.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# cross_validate takes the particular model, in this case lasso regularization
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
You will notice that several of the regression coefficient distributions differ significantly between the original linear model and the models regularised using Lasso or Ridge regression. The key examples here are `S1` to `S4`; in particular observe that S3 has gone from a positive effect to a negative effect which S2 and S4 have little to no effect.
We can look back at the correlation table to explain why this change occurs: it is due to multicollinearity. The correlation table shows that some of our input variables (`S1` to `S4`) have relatively strong correlations with one another. When we fit a linear model to input data containing two correlated variables A and B, we can produce multiple models with equivalent error by increasing the coefficient of A, while decreasing the coefficient of B to counteract the effect. Introducing regularisation tends to reduce this effect by introducing a penalty for the training algorithm's tendency to increase the coefficient values. The results of these regularised models indicate that we could potentially remove `S2` and `S4` from the model; they are strongly correlated with `S1` and `S3` respectively so do not contribute useful information.
%% Cell type:markdown id: tags:
### Feature Selection
These results indicate that we could remove some unimportant/confounding features from the model in order to reduce variance. We have identified here that BMI, BP and S5 are the features with the highest importance. It may be worth investigating whether we can achieve good predictions on the test set by using a model which only uses these features.
Interestingly, this does align with what we saw in the correlation coefficients - BMI, BP and S5 do have the strongest correlations with Y.
Including 'confounding' features in a model can be problematic. They may result in overfitting to the training data, and lead us to observe an effect that is actually not there. They can also inflate the importance of other variables in the case of multi-collinearty.
This, it is important to conduct some trials of models with different subsets of features. Correlation tables work well as a guide here, but ultimately we need to assess the effect of model changes on the error measure for the **test** set.