Unverified Commit 653f1ad3 authored by Simon Bowly's avatar Simon Bowly
Browse files

Week 5 ADS1002 notebook.

parent 411a5979
%% Cell type:markdown id: tags:
## Multivariate Linear Regression
* Topic: Supervised machine learning
* Unit: ADS1002
* Level: Beginner
* Authors: Simon Bowly
* Version: 3
Required files (download these from [the Gitlab site](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/Regression) into the same directory as the notebook on your computer):
* [Diabetes_Data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/Regression/Diabetes_Data.csv)
The objective of this notebook is to demonstrate how to build linear regression models with many features, extending the simple one- and two-variable regression models we have looked at so far. We'll use the model fitting methods in `sklearn` to determine the model parameters, and perform train-test splitting in order to assess the generalisabilty of the fitted models. We'll also look at the impact of feature normalisation on model parameters.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split # for splitting the data into training and testing sets
from sklearn.linear_model import LinearRegression # models we are going to use
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error # for comparing the predicted and test values
import seaborn as sns
df = pd.read_csv('Diabetes_Data.csv') # read the Diabetes dataset in to a pandas dataframe
```
%% Cell type:markdown id: tags:
### Data Analysis
`BP` is blood pressure, `BMI` is body mass index (healthy range for adults is 18.5 to 25) and `S1`-`S6` are various blood serum measurements. `Y` is the target variable, and is a measure of disease progression one year after the original measurements. Aim is to predict `Y` from the other variables.
We can look at the first few values of the variables, and at the descriptive statistics.
%% Cell type:code id: tags:
``` python
df.head()
```
%%%% Output: execute_result
AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y
0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151
1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75
2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141
3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206
4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135
%% Cell type:code id: tags:
``` python
df.describe()
```
%%%% Output: execute_result
AGE SEX BMI BP S1 S2 \
count 442.000000 442.000000 442.000000 442.000000 442.000000 442.000000
mean 48.518100 1.468326 26.375792 94.647014 189.140271 115.439140
std 13.109028 0.499561 4.418122 13.831283 34.608052 30.413081
min 19.000000 1.000000 18.000000 62.000000 97.000000 41.600000
25% 38.250000 1.000000 23.200000 84.000000 164.250000 96.050000
50% 50.000000 1.000000 25.700000 93.000000 186.000000 113.000000
75% 59.000000 2.000000 29.275000 105.000000 209.750000 134.500000
max 79.000000 2.000000 42.200000 133.000000 301.000000 242.400000
S3 S4 S5 S6 Y
count 442.000000 442.000000 442.000000 442.000000 442.000000
mean 49.788462 4.070249 4.641411 91.260181 152.133484
std 12.934202 1.290450 0.522391 11.496335 77.093005
min 22.000000 2.000000 3.258100 58.000000 25.000000
25% 40.250000 3.000000 4.276700 83.250000 87.000000
50% 48.000000 4.000000 4.620050 91.000000 140.500000
75% 57.750000 5.000000 4.997200 98.000000 211.500000
max 99.000000 9.090000 6.107000 124.000000 346.000000
%% Cell type:markdown id: tags:
Here the field `SEX` only has two values.
%% Cell type:code id: tags:
``` python
df['SEX'].unique()
```
%%%% Output: execute_result
array([2, 1])
%% Cell type:markdown id: tags:
Categorical data such as this can be converted to binary columuns using the `sklearn` method `OneHotEncoder`.However, since `SEX` is already binary, there is no need to do this. If we had a field which had three categories, for example, the data came from three different states, then we could create a binary field for each state, with the value 1 if the person is resident in the state and 0 otherwise. For more details see `sklearn.feature_extraction.DictVectorizer` and `sklearn.preprocessing.OneHotEncoder`.
We can look at the variable correlations to search for patterns. `SEX` doesn't seem to be important in predicting `Y`, and `AGE`, `S1` and `S2` are only marginally important. The most important variables seem to be `BMI`, `BP`, `S4` and `S5`. There is a strong correlation between `S1` and `S2`, and `S3` and `S4`.
%% Cell type:code id: tags:
``` python
corrs = df.corr() # calculate the correlation table
# as this is a symmetric table, set up a mask so that we only plot values below the main diagonal
mask = np.triu(np.ones_like(corrs, dtype=bool))
f, ax = plt.subplots(figsize=(10, 8)) # initialise the plots and axes
# plot the correlations as a seaborn heatmap, with a colourbar
sns.heatmap(corrs, mask=mask, center=0, annot=True, square=True, linewidths=.5)
# do some fiddling so that the top and bottom are not obscured
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5);
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
### Linear Regression
To carry out regression modelling we now create a matrix of independent variables (features) given by `X`, and the dependent or target variable `Y`. Specifically, we are building models of the form:
$$
Y_{pred} = \text{intercept} + c_{\text{age}} \times \text{AGE} + ... + c_{\text{S6}} \times \text{S6}
$$
and fitting the model parameters in order to minimise the cost function:
$$
\sum ( Y - Y_{pred} ) ^ 2
$$
%% Cell type:code id: tags:
``` python
X = df.drop(['Y'],axis=1) # Create a Dataframe without the target variable Y.
Y = df['Y'] # Extract a Series of the target Y values.
```
%% Cell type:markdown id: tags:
Now we split the data into training and testing sets, and model the data using the `sklearn` Linear Regression model. All supervised modelling methods (those with input features and output targets) follow the same key steps process in `sklearn`:
1. Split data into training and testing subsets
2. Instantiate a model type (with parameters)
3. Fit model parameters to the training data
4. Evaluate model performance on the testing data
%% Cell type:code id: tags:
``` python
# Split into a training set with 80% of the data, and a testing set as the remainder.
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.8,
random_state=np.random.RandomState(31287) # Keeps the 'random' split consistent.
)
# Construct and fit the model
linear = LinearRegression() # Instantatiate the linear regression model
linear.fit(X_train,Y_train) # Fit the model parameters to the training data.
# Evaluate model performance.
training_predictions = linear.predict(X_train) # Get model predictions for both.
testing_predictions = linear.predict(X_test) # training and testing data.
# Create a table of the various scores.
pd.DataFrame({
"R^2": {
"train": r2_score(Y_train, training_predictions),
"test": r2_score(Y_test, testing_predictions)
},
"RMSE": {
"train": mean_squared_error(Y_train, training_predictions, squared=False),
"test": mean_squared_error(Y_test, testing_predictions, squared=False),
},
"MAE": {
"train": mean_absolute_error(Y_train, training_predictions),
"test": mean_absolute_error(Y_test, testing_predictions),
},
})
```
%%%% Output: execute_result
R^2 RMSE MAE
train 0.643135 49.183238 40.645635
test 0.401523 58.304965 46.367576
%% Cell type:markdown id: tags:
We have three error scores here:
* $R^2$ is a 'goodness-of-fit' measure between 0 and 1,
* RMSE, the measure which linear regression aims to minimise, and
* MAE, similarly an error metric we want to minimise
Based on the $R^2$ scores we can see that the overall fit is not very good. Prediction error is quite high, being equivalent to about half a standard deviation in the target `Y` variable.
We can also see that the error measures vary quite significantly between the training and testing sets. Some variation is expected (we will always see lower $R^2$ and higher error on testing compared to training sets), but significant variation may indicate that the model is **overfitted** to the training data, and does not **generalise** well to the testing data.
%% Cell type:markdown id: tags:
### Analysis of Coefficients
We can investigate the model coefficients to get an idea of the importance of each independent variable in predicting `Y`. However, we cannot directly compare coefficients as their dimensions are different. For example, the `AGE` coefficient will be inversely proportional to years, while the `BP` coefficient will be inversely proportional to mm of Hg. Hence to compare the coefficients we need to make them dimensionless and we also need to know the variability of each of the dependent variables.
%% Cell type:code id: tags:
``` python
feature_names = X.columns.tolist() # write the column names to a list
# create a dataframe for which the columns are the feature names and regression coefficient
coefficients = pd.Series(data=linear.coef_.transpose(), index=feature_names)
coefficients.plot(kind='bar', figsize=(10, 7)) # plot these as a bar plot
plt.title('Linear Regression: Coefficient Importance') # add a title
plt.axhline(y=0, color='.5') # add the reference line y = 0
plt.subplots_adjust(left=.3) # move to the left a little bit
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
The variability of the features can be obtained from the standard deviation, which also has the same dimensions as the original variables. We can see a roughly inverse relationship between the coefficients and the standard deviation, i.e., as one goes up the other goes down.
%% Cell type:code id: tags:
``` python
X_train.std(axis=0).plot(kind='bar', figsize=(10, 7)) # plot the standard deviation of the training set features
plt.title('Features std. dev.') # add a title
plt.subplots_adjust(left=.3) # move a bit to the left
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Hence the importance of each coefficient can be found by multiplying the coefficient by the standard deviation for each variable. This suggests that `S1` and `S2` have significant effects on `Y`, which contradicts the conclusions drawn from the corrrelation table earlier. What we haven't yet taken into account is the variability of the coefficients.
%% Cell type:code id: tags:
``` python
# create a dataframe with the columns being the feature names and regression coefficients multiplied by the
# standard deviation
coefs = pd.DataFrame(
linear.coef_ * X_train.std(axis=0),
columns=['Coefficient importance'], index=feature_names
)
# do a similar plot of this variable, as above
coefs.plot(kind='bar', figsize=(10, 7))
plt.title('Linear Regression')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
To compare the importance of the coefficients it is easiest to normalise the independent variables from the start. The dependent variable does not need to be normalised. We normalise all the variables such that their mean is zero and their standard deviation is one, which can be done by subtracting the mean and then dividing by the standard deviation. This is equivalent to multiplying the coefficients by the standard deviation.
%% Cell type:code id: tags:
``` python
nX =(X-X.mean())/X.std() # create nX, a normalised version of X
nX.describe() # show the descriptive statistics of nX
```
%%%% Output: execute_result
AGE SEX BMI BP S1 \
count 442.000000 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02
mean 0.000000 2.893613e-16 1.125294e-16 1.044916e-15 -2.893613e-16
std 1.000000 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -2.251738 -9.374744e-01 -1.895781e+00 -2.360375e+00 -2.662394e+00
25% -0.783285 -9.374744e-01 -7.188104e-01 -7.697777e-01 -7.192046e-01
50% 0.113044 -9.374744e-01 -1.529591e-01 -1.190789e-01 -9.073818e-02
75% 0.799594 1.064282e+00 6.562083e-01 7.485196e-01 5.955183e-01
max 2.325260 1.064282e+00 3.581660e+00 2.772916e+00 3.232188e+00
S2 S3 S4 S5 S6
count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02
mean -1.125294e-16 -1.245861e-16 -1.527185e-16 2.451533e-16 2.531911e-16
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -2.427874e+00 -2.148448e+00 -1.604285e+00 -2.648040e+00 -2.893112e+00
25% -6.375263e-01 -7.374604e-01 -8.293610e-01 -6.981574e-01 -6.967595e-01
50% -8.020037e-02 -1.382738e-01 -5.443750e-02 -4.089059e-02 -2.263165e-02
75% 6.267323e-01 6.155415e-01 7.204860e-01 6.810788e-01 5.862581e-01
max 4.174548e+00 3.804760e+00 3.889923e+00 2.805543e+00 2.847848e+00
%% Cell type:markdown id: tags:
We repeat the linear regression, but now using the normalized variables. While the model parameters will change, there is no change in any of our error measures. This is because the model itself has not really changed: scaling the input features is offset by scaling in the model parameters.
%% Cell type:code id: tags:
``` python
# The below code is identical to our first attempt, the only change is
# substituting nX for X in train_test_split, to use our normalised data.
# Split into a training set with 80% of the data, and a testing set as the remainder.
X_train, X_test, Y_train, Y_test = train_test_split(
nX, Y, test_size=0.8,
random_state=np.random.RandomState(31287) # Keeps the 'random' split consistent.
)
# Construct and fit the model
linear = LinearRegression() # Instantatiate the linear regression model
linear.fit(X_train,Y_train) # Fit the model parameters to the training data.
# Evaluate model performance.
training_predictions = linear.predict(X_train) # Get model predictions for both.
testing_predictions = linear.predict(X_test) # training and testing data.
# Create a table of the various scores.
pd.DataFrame({
"R^2": {
"train": r2_score(Y_train, training_predictions),
"test": r2_score(Y_test, testing_predictions)
},
"RMSE": {
"train": mean_squared_error(Y_train, training_predictions, squared=False),
"test": mean_squared_error(Y_test, testing_predictions, squared=False),
},
"MAE": {
"train": mean_absolute_error(Y_train, training_predictions),
"test": mean_absolute_error(Y_test, testing_predictions),
},
})
```
%%%% Output: execute_result
R^2 RMSE MAE
train 0.643135 49.183238 40.645635
test 0.401523 58.304965 46.367576
%% Cell type:markdown id: tags:
Again we can plot the linear regression coefficients, but this time we compare them against our original linear regression coefficients to investigate the variability. It can now be seen that the effect of the blood serum measurements can have considerable variability.
What we can see here is that the importance of feature S5 was over-inflated when we looked at the coefficients without normalisation, due to it's very low variance
%% Cell type:code id: tags:
``` python
# create a new dataframe with the regression coefficients from the normalised data
normed_coefficients = pd.Series(data=linear.coef_.transpose(), index=feature_names)
# add our original coefficient importance to this dataframe
all_coefficients = pd.DataFrame({
'Normalised': normed_coefficients,
'Original': coefficients,
})
# do a similar horizontal plot as before
ax = all_coefficients.plot(kind='bar',figsize=(10,7))
plt.title('Linear Regression')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
To investigate the variability we can use the `sklearn` methods `cross_validate` and `RepeatedKFold`. The first of these performs a number of runs of a model. The second splits the data in n sections and repeats the calculations m times. This gives n.m runs to investigate the variability of the coefficients. The variability of these can then be plotted using a boxplot.
We can see `AGE`, `SEX`, `BP`, `BMI` and `S6` have very low variance, whereas `S1`-`S5` have high variance. In particular we can see that `AGE` and `S3` do not have a clear effect in the model: in some model fits, they produce a negative coefficient, while in others they produce a positive coefficient. It is therefore not clear whether there is any statistical significance of these parameters in the model.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_validate, RepeatedKFold # import sklearn methods
# cross_validate takes the particular model, in this case linear regression which we instantatiated earlier,
# and undertakes a number of runs according the method specified by cv=
# RepeatedKFold splits the data into n sections and repeat the regression modelling 5 times, giving 25 runs
# return_estimator=True returns the fitting data for each run
scores = cross_validate(
linear, nX, Y,
cv=RepeatedKFold(n_splits=5, n_repeats=5, random_state=np.random.RandomState(2351786)),
return_estimator=True,
)
# take the results for each simulation (estimator), extract the coefficients for each run
# and add them to a dataframe with columns being the feature names
coefs = pd.DataFrame([est.coef_ for est in scores['estimator']],columns=nX.columns)
# plot the descriptive statics of the coefficients in a box and whisker plot to show variability
ax = coefs.plot(kind='box',figsize=(10,7))
plt.title('Linear Regression')
plt.axhline(y=0, color='.5')
plt.subplots_adjust(left=.3)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Exercises
### Exercise 1
Q: Compare the model coefficients computed in the linear regression with the correlation scores found in the initial data analysis (choose any plot or table which you think is suitable for this comparison). Comment on whether the model fitting results are consistent with the correlation scores.
A:
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Exercise 2
Q: Try fitting a linear regression model to a subset of the features. To do this, select 3-4 columns from the `nX` dataframe to use when constructing the train-test split, then fit the model parameters to this new training data as before. Describe how you selected your subset of features. Did the model coefficients for your chosen features turn out to be similar or different to those in the full model? (If there is a difference, explain why).
A:
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Exercise 3
Q: Repeat the k-fold cross-validation test for coefficient variability, but first remove the feature 'S2' from the dataset. Observe the impact on the coefficients for the other features. In particular, look at the coefficient for 'S1'. You should see that the range of values produced for the S1 coefficient is much narrower. Can you think of a reason why this is the case (consider the mathematical form of the model)?
A:
%% Cell type:code id: tags:
``` python
```
......@@ -40,6 +40,7 @@
* Week 2 [Measuring Errors and Fitting Models](Machine-Learning/Supervised-Methods/Errors-and-Models.ipynb)
* Week 3 [Linear Regression Algorithms](Machine-Learning/Supervised-Methods/Regression-Fitting.ipynb)
* Week 4 [Time Series Data](Pandas-DataFrames/Time-Series/Time-Series-Data-Intro.ipynb)
* Week 5 [Multivariate Linear Regression](Machine-Learning/Supervised-Methods/Regression/Multivariate-Linear-Regression.ipynb)
### ADS2001
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment