linear=LinearRegression()# instantatiate the linear regression model
linear.fit(X_train,Y_train)# fit the data to the model
training_score=linear.score(X_train,Y_train)# calculate rsq for the training set
# use the independent variables for the testing set to predict the target variable
preds_linear=linear.predict(X_test)
...
...
@@ -79,16 +78,18 @@
We can see `AGE`, `SEX`, `BP`, `BMI` and `S6` have very low variance, whereas `S1`-`S5` have high variance due to overfitting. Of the low variance features, `AGE` and `S6` seem to have little effect on the predictions.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# cross_validate takes the particular model, in this case linear regression which we instantatiated earlier,
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
To use Ridge regularization (which adds a penalty term which is proportional to the sum of the squares of the coefficients), we need to find the optimal value of tuning parameter alpha. Generally this would be done using `RidgeCV`, however here we will graphically compare the training and testing scores. What we want to determine is the value of alpha for which we obtain the maximum value of the testing score. To generate the figure we create an array of alpha values, which in this case are logarithmically distributed, and perform a Ridge regularization for each, and store the testing and training scores ( $R^2$ ). Then we plot these against alpha. From the figure we see the optimal value occurs at alpha approximately 20.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# create an array of 21 alpha values logarithmically distributed between 10**(-2) and 10**2
alfas=np.logspace(-2,2,num=21)
# create two arrays for storage of the same size as alfas, but filled with zeros
ridge_training_score=np.zeros_like(alfas)
ridge_rsquared=np.zeros_like(alfas)
# loop over the values in the alfas array, at each loop the current value is alfa
# and idx is incremented by 1, starting at 0
foridx,alfainenumerate(alfas):
ridge=Ridge(alpha=alfa)# instantatiate Ridge regularization with the current alfa
ridge=Ridge(alpha=alfa,random_state=1235)# instantatiate Ridge regularization with the current alfa
ridge.fit(X_train,Y_train)# train the model to our data set
# calculate the training score and store in the array ridge_training_score
Now can investigate Ridge regularization using the optimal value of alpha approximately 20. Now we see significantly reduced variance in the coefficient and that the most important variables are `BMI`, `BP` and `S5`.
Now can investigate Ridge regularization using the optimal value of alpha approximately 40. Now we see significantly reduced variance in the coefficient and that the most important variables are `BMI`, `BP` and `S5`.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# cross_validate takes the particular model, in this case ridge regularization
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value for alpha is approximately 2.
We can repeat the same process for Lasso regularization, which adds a penalty term which is proportional to the sum of the absolute values of the coefficients. Here the optimal value for alpha is approximately 5.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# create two arrays for storage of the same size as alfas, but filled with zeros
lasso_training_score=np.zeros_like(alfas)
lasso_rsquared=np.zeros_like(alfas)
# loop over the values in the alfas array, at each loop the current value is alfa
Again, it can be seen that the most significant variables are `BMI`, `BP` and `S5`. In this case the coefficients for `AGE`, `S2` and `S4` are zero or close to zero.
%% Cell type:code id: tags:
``` python
rng=np.random.RandomState(1)# make sure the results are repeatable
# cross_validate takes the particular model, in this case lasso regularization
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run