Commit 14e90bb1 authored by Simon Clarke's avatar Simon Clarke
Browse files

Added solution notebooks

parent 80fe1017
%% Cell type:markdown id: tags:
# Imputing Missing Data
%% Cell type:markdown id: tags:
In this notebook we investigate ways for dealing with missing data using Scikit-Learn's imputation routines. There are three main routines we will discuss: `SimpleImputer`, `KNNImputer` and `IterativeImputer`. We will also only discuss imputing continuous, numerical values, for imputing categorical values possible approaches are using the mode or the `KNNImputer`.
We will use the [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database), which can be downloaded from [Monash Gitlab](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Imputation/pima_indians_diabetes.csv). This aims to predict whether the patient has diabetes from a number of diagnostic measurements. All patients are females of Pima Indian heritage, who are at least 21 years old.
Previously, we have dealt with missing data by deleting that entry. However, that means losing valuable data which contributes to the training of your model. A better approach is to impute the data, i.e., infer the missing data from the existing observations.
We will concentrate here on Scikit-Learn's imputation routines, although some of the techniques, such as replacement of values with the mean or mode, can be easily implemented in Pandas.
%% Cell type:markdown id: tags:
## Contents
%% Cell type:markdown id: tags:
* Introduction
* Cross-validation analysis
* Exercises
%% Cell type:markdown id: tags:
## Introduction
%% Cell type:code id: tags:
``` python
We first import the standard libraries and the csv file.
```
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pima = pd.read_csv("pima_indians_diabetes.csv")
```
%% Cell type:markdown id: tags:
We can now view a random sample of the data. In the columns `BloodPressure`, `SkinThickness` and `Insulin` there are values of 0, which are clearly not physical. This is indicative of missing values.
%% Cell type:code id: tags:
``` python
pima.sample(20)
```
%%%% Output: execute_result
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
652 5 123 74 40 77 34.1
219 5 112 66 0 0 37.8
205 5 111 72 28 0 23.9
359 1 196 76 36 249 36.5
672 10 68 106 23 49 35.5
402 5 136 84 41 88 35.0
626 0 125 68 0 0 24.7
564 0 91 80 0 0 32.4
34 10 122 78 31 0 27.6
68 1 95 66 13 38 19.6
639 1 100 74 12 46 19.5
400 4 95 64 0 0 32.0
314 7 109 80 31 0 35.9
508 2 84 50 23 76 30.4
105 1 126 56 29 152 28.7
35 4 103 60 33 192 24.0
164 0 131 88 0 0 31.6
244 2 146 76 35 194 38.2
220 0 177 60 29 478 34.6
11 10 168 74 0 0 38.0
DiabetesPedigreeFunction Age Outcome
652 0.269 28 0
219 0.261 41 1
205 0.407 27 0
359 0.875 29 1
672 0.285 47 0
402 0.286 35 1
626 0.206 21 0
564 0.601 27 0
34 0.512 45 0
68 0.334 25 0
639 0.149 28 0
400 0.161 31 1
314 1.127 43 1
508 0.968 21 0
105 0.801 21 0
35 0.966 33 0
164 0.743 32 1
244 0.329 29 0
220 1.072 21 1
11 0.537 34 1
%% Cell type:markdown id: tags:
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of 0 for `Pregnancies` is a physically realistic value.
%% Cell type:code id: tags:
``` python
pima.describe()
```
%%%% Output: execute_result
Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000
BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000
%% Cell type:markdown id: tags:
To see how many 0 values there are in these fields, we can sum the number of rows which match this criteria. The two fields with the most missing values are `SkinThickness` and `Insulin`.
%% Cell type:code id: tags:
``` python
print((pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] == 0).sum())
```
%% Cell type:markdown id: tags:
Since 0 is a valid entry in `Pregnancies` and `Outcome`, it is easiest to mark the missing values as NaN (not a number). This is the default for missing values for the sklearn imputation routines. Marking the values as NaN gives the same number of missing entries as previously.
%% Cell type:code id: tags:
``` python
pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = \
pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
print(pima.isnull().sum())
```
%% Cell type:markdown id: tags:
We now set up a simple random forest model to investigate the effect of a selection of different imputation methods on the accuracy and feature importance. The function below creates a random forest model for the diabetes data, prints the accuracy and feature importance in descending order.
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def rf_model(pimadf):
Xf = pimadf.drop(columns=['Outcome'])
Yf = np.ravel(pimadf[['Outcome']])
X_train, X_test, Y_train, Y_test = train_test_split(Xf, Yf, test_size=0.2, random_state=0)
rfc = RandomForestClassifier()
rfc.fit(X_train, Y_train) # fit the data to the model
Y_pred = rfc.predict(X_test)
acc = accuracy_score(Y_test,Y_pred)
print("Testing score is %5.3f" % acc)
feature_importances = pd.DataFrame(rfc.feature_importances_,
index = X_train.columns,
columns=['Importance']).sort_values('Importance', ascending=False)
print(feature_importances)
```
%% Cell type:markdown id: tags:
The first example is to drop all rows which have a missing value. This results in approximately half of the dataset being dropped. For this data one of the most important features is `Insulin`, which is the feature with the most missing values.
%% Cell type:code id: tags:
``` python
pima_drop = pima.copy()
pima_drop.dropna(inplace=True)
print('Dropping rows')
print('Shape of array',pima_drop.shape)
print('Shape of original array',pima.shape)
rf_model(pima_drop)
```
%% Cell type:markdown id: tags:
For reference we can plot the distribution of `Insulin` and `SkinThickness` to investigate how different imputation methods affect this. For the other features with missing values the distributions will not be significantly affected, and the exact imputation method probably is not critical to the model.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(1, 2, figsize=(10,4))
sns.histplot(pima_drop, x="Insulin", stat="density", ax = axes[0])
sns.histplot(pima_drop, x="SkinThickness", stat="density", ax = axes[1])
axes[1].set_xlim(0,100)
axes[1].set_ylim(0,0.06)
```
%%%% Output: execute_result
(0.0, 0.06)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Replacing the missing values with the mean or the median now results in a decrease in the accuracy of the model, and the feature importance of `SkinThickness` and `Insulin` being ranked very low. This is understandable, as replacing the missing values with a constant results in the reduction of the variance of the feature.
%% Cell type:code id: tags:
``` python
pima_mean = pima.copy()
pima_mean.fillna(pima_mean.mean(), inplace=True)
print('Imputation using mean')
rf_model(pima_mean)
```
%% Cell type:code id: tags:
``` python
pima_median = pima.copy()
pima_median.fillna(pima_mean.median(), inplace=True)
print('Imputation using median')
rf_model(pima_median)
```
%% Cell type:markdown id: tags:
This reduction in variance can be clearly seen by plotting the distributions of the mean and median datasets. For both, the plots are now dominated by a single peak.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(2, 2, figsize=(10,7))
sns.histplot(pima_mean, x="Insulin", stat="density", ax = axes[0,0])
axes[0,0].set_xlabel('Insulin - Mean')
sns.histplot(pima_median, x="Insulin", stat="density", ax = axes[1,0])
axes[1,0].set_xlabel('Insulin - Median')
sns.histplot(pima_mean, x="SkinThickness", stat="density", ax = axes[0,1])
axes[0,1].set_xlabel('SkinThickness - Mean')
sns.histplot(pima_median, x="SkinThickness", stat="density", ax = axes[1,1])
axes[1,1].set_xlabel('SkinThickness - Median');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
The first machine learning imputer we consider is the k-Nearest Neighbours imputer. This iterates through all the missing values, treating each one as a label, and then finds the corresponding label of its k-Nearest Neighbours. This will be affected by the distance metric that is used, the number of neighbours and the order that the features are imputed. In this case we use the default values and 5 neighbours.
Now the accuracy is approximately the same as for the dataset where we dropped the rows, and higher than using the mean and median. Also, since the ranking of features is more consistent with the original dataset.
%% Cell type:code id: tags:
``` python
from sklearn.impute import KNNImputer
pima_knnn = pima.copy()
X = pima_knnn.iloc[:,0:8]
Xm = X.mean()
Xs = X.std()
X = (X-X.mean())/X.std()
Xt = KNNImputer(n_neighbors=5).fit_transform(X)
pima_knnn.iloc[:,0:8] = Xt
print('Imputation using k-Nearest Neighbours')
pima_knnn.iloc[:,0:8] = Xs*pima_knnn.iloc[:,0:8]+Xm
rf_model(pima_knnn)
```
%% Cell type:markdown id: tags:
The second method we consider is the sklearn `IterativeImputer`. This is an experimental addition to sklearn, so needs to be enabled as well as imported. As it is experimental, it may change in future versions.
`IterativeImputer` works be marking the missing values, and then repeating the imputation process N times or until the data converges. Initially the missing values are set using a simple scheme, such as being replaced by the mean or median. Then on each iteration a machine learning algorithm is used as a regressor to update each column which is marked as having missing values. The non-missing values are used to train the model, and then the model is used to predict the missing values. Any regression technique could be used to predict the missing values. Common ones that are used are BayesianRidge, k-Nearest Neighbours and Random Forest Regression. Using this algorithm with Random Forest Regression is equivalent to the R routine `missForest`. The routine `KNNImputer` can be seen as `IterativeImputer` with one iteration.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of `KNNImputer`.
%% Cell type:code id: tags:
``` python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
pima_ii = pima.copy()
X = pima_ii.iloc[:,0:8]
Xm = X.mean()
Xs = X.std()
X = (X-X.mean())/X.std()
Xt = IterativeImputer(max_iter=20).fit_transform(X)
pima_ii.iloc[:,0:8] = Xt
print('Iterative imputation using Bayesian Ridge')
pima_ii.iloc[:,0:8] = Xs*pima_ii.iloc[:,0:8]+Xm
rf_model(pima_ii)
```
%% Cell type:markdown id: tags:
Plotting the distributions shows that the `KNNImputer` and `IterativeImputer` gives similar results for `Insulin`, but that `IterativeImputer` seems to give a distribution which is more consistent with the original dataset for `SkinThickness`.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(3, 2, figsize=(10,10))
sns.histplot(pima_drop, x="Insulin", stat="density", ax = axes[0,0])
axes[0,0].set_xlabel('Insulin - Drop Rows')
sns.histplot(pima_knnn, x="Insulin", stat="density", ax = axes[1,0])
axes[1,0].set_xlabel('Insulin - kNN')
sns.histplot(pima_ii, x="Insulin", stat="density", ax = axes[2,0])
axes[2,0].set_xlabel('Insulin - Iterative Imputer')
sns.histplot(pima_drop, x="SkinThickness", stat="density", ax = axes[0,1])
axes[0,1].set_xlabel('SkinThickness - Drop Rows')
axes[0,1].set_xlim(0,100)
axes[0,1].set_ylim(0,0.06)
sns.histplot(pima_knnn, x="SkinThickness", stat="density", ax = axes[1,1])
axes[1,1].set_xlabel('SkinThickness - kNN')
axes[1,1].set_xlim(0,100)
axes[1,1].set_ylim(0,0.06)
sns.histplot(pima_ii, x="SkinThickness", stat="density", ax = axes[2,1])
axes[2,1].set_xlabel('SkinThickness - Iterative Imputer')
axes[2,1].set_xlim(0,100)
axes[2,1].set_ylim(0,0.06)
```
%%%% Output: execute_result
(0.0, 0.06)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Cross-validation analysis
%% Cell type:markdown id: tags:
For all the examples so far we have only consider one realisation of the Random Forest Regressor. To understand the effectiveness of the various imputation algorithms we need to combine this with cross validation. The following code consider the variation of the f1-score using Logistic Regression for the imputation strategies:
* Drop rows with missing values.
* Simple imputation using the mean.
* Simple imputation using the median.
* k-Nearest Neighbours imputation.
* Iterative imputation using:
* BayesianRidge,
* DecisionTreeRegressor,
* RandomForestRegressor,
* k-Nearest Neighbours Regression.
The individual scores for each run are stored in a dataframe, for which we can finally investigate the descriptive statistics.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedKFold
N_SPLITS = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
classifier = LogisticRegression(solver='newton-cg', C=1.e3)
score = 'f1'
X_full = pima_drop.drop(columns=['Outcome'])
Y_full = np.ravel(pima_drop[['Outcome']])
score_drop = pd.DataFrame(
cross_val_score(
classifier, X_full, Y_full, scoring=score, cv=N_SPLITS
),
columns=['Drop Data']
)
```
%% Cell type:code id: tags:
``` python
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
# Estimate the score after imputation (mean and median strategies)
X_missing = pima.drop(columns=['Outcome'])
Y_missing = np.ravel(pima[['Outcome']])
score_simple_imputer = pd.DataFrame()
for strategy in ('mean', 'median'):
estimator = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy=strategy),
classifier
)
score_simple_imputer[strategy] = cross_val_score(
estimator, X_missing, Y_missing, scoring=score, cv=N_SPLITS
)
```
%% Cell type:code id: tags:
``` python
from sklearn.impute import KNNImputer
score_knn_imputer = pd.DataFrame()
estimator = make_pipeline(
KNNImputer(n_neighbors=15),
classifier
)
score_knn_imputer['KNeighborsRegressor'] = cross_val_score(
estimator, X_missing, Y_missing, scoring=score, cv=N_SPLITS
)
```
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
estimators = [
BayesianRidge(),
DecisionTreeRegressor(random_state=0),
RandomForestRegressor(random_state=0),
KNeighborsRegressor(n_neighbors=15)
]
score_iterative_imputer = pd.DataFrame()
for impute_estimator in estimators:
estimator = make_pipeline(
IterativeImputer(random_state=0, estimator=impute_estimator, max_iter=10),
classifier
)
score_iterative_imputer[impute_estimator.__class__.__name__] = \
cross_val_score(
estimator, X_missing, Y_missing, scoring=score, cv=N_SPLITS
)
```
%% Cell type:markdown id: tags:
We can now investigate the descriptive statistics for each imputation method in tabular and graphical format. The green dots in the figure represent the mean values for each method.
The first thing to note is that by using imputation with the full dataset, the variance of the model has been reduced significantly, which suggests that the `Drop Data` model suffers from overfitting. This is consistent with the fact that one way to reduce overfitting is to increase the amount of data. In general, as the complexity of the imputer is increased the accuracy also increases, though the result is dependent on the underlying strategy.
For this example, the best methods for imputation seem to:
* Simple imputation using the median.
* Iterative imputation using:
* BayesianRidge,
* RandomForestRegressor.
As with most modelling the final strategy for imputation depends on the model that you use, and should be decided on after extensive initial testing.
%% Cell type:code id: tags:
``` python
scores = pd.concat(
[score_drop, score_simple_imputer, score_knn_imputer, score_iterative_imputer],
keys=['Original', 'SimpleImputer', 'KNN', 'IterativeImputer'], axis=1
)
scores.describe()
```
%%%% Output: execute_result
Original SimpleImputer KNN \
Drop Data mean median KNeighborsRegressor
count 15.000000 15.000000 15.000000 15.000000
mean 0.625292 0.632006 0.630530 0.634507
std 0.093487 0.028496 0.028269 0.029221
min 0.387097 0.589474 0.589474 0.589474
25% 0.588040 0.606338 0.604167 0.605114
50% 0.641509 0.639175 0.633663 0.640777
75% 0.673333 0.653870 0.648134 0.659567
max 0.740741 0.675000 0.675000 0.674157
IterativeImputer \
BayesianRidge DecisionTreeRegressor RandomForestRegressor
count 15.000000 15.000000 15.000000
mean 0.637322 0.626273 0.633008
std 0.032623 0.033283 0.028420
min 0.589474 0.571429 0.589474
25% 0.608206 0.600885 0.610594
50% 0.633663 0.631579 0.628571
75% 0.663522 0.654536 0.659471
max 0.688889 0.673913 0.674157
KNeighborsRegressor
count 15.000000
mean 0.627713
std 0.024185
min 0.589474
25% 0.605454
50% 0.629213
75% 0.649111
max 0.659091
%% Cell type:code id: tags:
``` python
# plot results
fig, ax = plt.subplots(figsize=(13, 6))
scores.plot.box(showmeans=True, vert=False, ax=ax)
ax.set_title('Pima Indian Diabetes Classification with Different Imputation Methods')
ax.set_xlabel('Accuracy (larger is better)')
plt.show()
```
%%%% Output: display_data
![](