Commit 62b77bc2 authored by Simon Clarke's avatar Simon Clarke
Browse files

Final version of Imputation notebook.

parent 1adb9ff8
%% Cell type:markdown id: tags:
# Imputing Missing Data
%% Cell type:markdown id: tags:
In this notebook we investigate ways for dealing with missing data using Scikit-Learn's imputation routines. There are three main routines we will discuss: `SimpleImputer`, `KNNImputer` and `IterativeImputer`. We will also only discuss imputing continuous, numerical values, for imputing categorical values possible approaches are using the mode or the `KNNImputer`.
We will use the [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database), which can be downloaded from [Monash Gitlab](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Imputation/pima_indians_diabetes.csv). This aims to predict whether the patient has diabetes from a number of diagnostic measurements. All patients are females of Pima Indian heritage, who are at least 21 years old.
Previously, we have dealt with missing data by deleting that entry. However, that means losing valuable data which contributes to the training of your model. A better approach is to impute the data, i.e., infer the missing data from the existing observations.
We will concentrate here on Scikit-Learn's imputation routines, although some of the techniques, such as replacement of values with the mean or mode, can be easily implemented in Pandas.
%% Cell type:markdown id: tags:
## Contents
%% Cell type:markdown id: tags:
* Introduction
* Cross-validation analysis
* Exercises
%% Cell type:markdown id: tags:
## Introduction
%% Cell type:markdown id: tags:
We first import the standard libraries and the csv file.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pima = pd.read_csv("pima_indians_diabetes.csv")
```
%% Cell type:markdown id: tags:
We can now view a random sample of the data. In the columns `BloodPressure`, `SkinThickness` and `Insulin` there are values of 0, which are clearly not physical. This is indicative of missing values.
%% Cell type:code id: tags:
``` python
pima.sample(20)
```
%%%% Output: execute_result
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
480 3 158 70 30 328 35.5
652 5 123 74 40 77 34.1
219 5 112 66 0 0 37.8
205 5 111 72 28 0 23.9
359 1 196 76 36 249 36.5
672 10 68 106 23 49 35.5
402 5 136 84 41 88 35.0
626 0 125 68 0 0 24.7
564 0 91 80 0 0 32.4
34 10 122 78 31 0 27.6
68 1 95 66 13 38 19.6
639 1 100 74 12 46 19.5
400 4 95 64 0 0 32.0
314 7 109 80 31 0 35.9
508 2 84 50 23 76 30.4
105 1 126 56 29 152 28.7
35 4 103 60 33 192 24.0
164 0 131 88 0 0 31.6
244 2 146 76 35 194 38.2
220 0 177 60 29 478 34.6
11 10 168 74 0 0 38.0
111 8 155 62 26 495 34.0
186 8 181 68 36 495 30.1
396 3 96 56 34 115 24.7
258 1 193 50 16 375 25.9
26 7 147 76 0 0 39.4
679 2 101 58 17 265 24.2
139 5 105 72 29 325 36.9
269 2 146 0 0 0 27.5
589 0 73 0 0 0 21.1
547 4 131 68 21 166 33.1
401 6 137 61 0 0 24.2
168 4 110 66 0 0 31.9
12 10 139 80 0 0 27.1
412 1 143 84 23 310 42.4
6 3 78 50 32 88 31.0
153 1 153 82 42 485 40.6
71 5 139 64 35 140 28.6
632 2 111 60 0 0 26.2
DiabetesPedigreeFunction Age Outcome
480 0.344 35 1
652 0.269 28 0
219 0.261 41 1
205 0.407 27 0
359 0.875 29 1
672 0.285 47 0
402 0.286 35 1
626 0.206 21 0
564 0.601 27 0
34 0.512 45 0
68 0.334 25 0
639 0.149 28 0
400 0.161 31 1
314 1.127 43 1
508 0.968 21 0
105 0.801 21 0
35 0.966 33 0
164 0.743 32 1
244 0.329 29 0
220 1.072 21 1
11 0.537 34 1
111 0.543 46 1
186 0.615 60 1
396 0.944 39 0
258 0.655 24 0
26 0.257 43 1
679 0.614 23 0
139 0.159 28 0
269 0.240 28 1
589 0.342 25 0
547 0.160 28 0
401 0.151 55 0
168 0.471 29 0
12 1.441 57 0
412 1.076 22 0
6 0.248 26 1
153 0.687 23 0
71 0.411 26 0
632 0.343 23 0
%% Cell type:markdown id: tags:
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of `Pregnancies` of 0, is a physically realistic value.
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of 0 for `Pregnancies` is a physically realistic value.
%% Cell type:code id: tags:
``` python
pima.describe()
```
%%%% Output: execute_result
Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000
BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000
%% Cell type:markdown id: tags:
To see how many 0 values there are in these fields, we can sum the number of rows which match this criteria. The two fields with the most missing values are `SkinThickness` and `Insulin`.
%% Cell type:code id: tags:
``` python
print((pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] == 0).sum())
```
%%%% Output: stream
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
dtype: int64
%% Cell type:markdown id: tags:
Since 0 is a valid entry in `Pregnancies` and `Outcome`, it is easiest to mark the missing values as NaN (not a number). This is the default for missing values for the sklearn imputation routines. Marking the values as NaN gives the same number of missing entries as previously.
%% Cell type:code id: tags:
``` python
pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = \
pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
print(pima.isnull().sum())
```
%%%% Output: stream
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
%% Cell type:markdown id: tags:
We now set up a simple random forest model to investigate the effect of a selection of different imputation methods on the accuracy and feature importance. The function below creates a random forest model for the diabetes data, prints the accuracy and feature importance in descending order.
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def rf_model(pimadf):
Xf = pimadf.drop(columns=['Outcome'])
Yf = np.ravel(pimadf[['Outcome']])
X_train, X_test, Y_train, Y_test = train_test_split(Xf,Yf,test_size=0.8,random_state=0)
X_train, X_test, Y_train, Y_test = train_test_split(Xf, Yf, test_size=0.2, random_state=0)
rfc = RandomForestClassifier()
rfc.fit(X_train,Y_train) # fit the data to the model
rfc.fit(X_train, Y_train) # fit the data to the model
Y_pred = rfc.predict(X_test)
acc = accuracy_score(Y_test,Y_pred)
print("Testing score is %5.3f" % acc)
feature_importances = pd.DataFrame(rfc.feature_importances_,
index = X_train.columns,
columns=['Importance']).sort_values('Importance', ascending=False)
print(feature_importances)
```
%% Cell type:markdown id: tags:
The first example is to drop all rows which have a missing value. This results in approximately half of the dataset being dropped. For this data one of the most important features is `Insulin`, which is the feature with the most missing values.
%% Cell type:code id: tags:
``` python
pima_drop = pima.copy()
pima_drop.dropna(inplace=True)
print('Dropping rows')
print('Shape of array',pima_drop.shape)
print('Shape of original array',pima.shape)
rf_model(pima_drop)
```
%%%% Output: stream
Dropping rows
Shape of array (392, 9)
Shape of original array (768, 9)
Testing score is 0.761
Testing score is 0.810
Importance
Glucose 0.277382
Age 0.150262
Insulin 0.146099
BMI 0.127310
BloodPressure 0.092259
SkinThickness 0.084266
DiabetesPedigreeFunction 0.078590
Pregnancies 0.043832
Glucose 0.250222
Insulin 0.148675
Age 0.143560
DiabetesPedigreeFunction 0.112166
BMI 0.111587
SkinThickness 0.087640
BloodPressure 0.082608
Pregnancies 0.063542
%% Cell type:markdown id: tags:
For reference we can plot the distribution of `Insulin` and `SkinThickness` to investigate how different imputation methods affect this. For the other features with missing values the distributions will not be significantly affected, and the exact imputation method probably is not critical to the model.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(1, 2, figsize=(10,4))
sns.histplot(pima_drop, x="Insulin", stat="density", ax = axes[0])
sns.histplot(pima_drop, x="SkinThickness", stat="density", ax = axes[1])
axes[1].set_xlim(0,100)
axes[1].set_ylim(0,0.06)
```
%%%% Output: execute_result
(0.0, 0.06)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Replacing the missing values with the mean or the median now results in a decrease in the accuracy of the model, and the feature importance of `SkinThickness` and `Insulin` being ranked very low. This is understandable, as replacing the missing values with a constant results in the reduction of the variance of the feature.
%% Cell type:code id: tags:
``` python
pima_mean = pima.copy()
pima_mean.fillna(pima_mean.mean(), inplace=True)
print('Imputation using mean')
rf_model(pima_mean)
```
%%%% Output: stream
Imputation using mean
Testing score is 0.733
Testing score is 0.818
Importance
Glucose 0.217646
BMI 0.169133
BloodPressure 0.134262
DiabetesPedigreeFunction 0.121409
Age 0.105867
SkinThickness 0.102177
Pregnancies 0.075955
Insulin 0.073552
Glucose 0.238235
BMI 0.163498
Age 0.136670
DiabetesPedigreeFunction 0.129249
Insulin 0.087349
BloodPressure 0.085985
Pregnancies 0.082068
SkinThickness 0.076946
%% Cell type:code id: tags:
``` python
pima_median = pima.copy()
pima_median.fillna(pima_mean.median(), inplace=True)
print('Imputation using median')
rf_model(pima_median)
```
%%%% Output: stream
Imputation using median
Testing score is 0.741
Testing score is 0.818
Importance
Glucose 0.232470
BMI 0.165862
BloodPressure 0.132452
DiabetesPedigreeFunction 0.114867
Age 0.107119
SkinThickness 0.102624
Insulin 0.077583
Pregnancies 0.067023
Glucose 0.242399
BMI 0.157620
Age 0.136504
DiabetesPedigreeFunction 0.126532
BloodPressure 0.092300
Insulin 0.088321
SkinThickness 0.078420
Pregnancies 0.077905
%% Cell type:markdown id: tags:
This reduction in variance can be clearly seen by plotting the distributions of the mean and median datasets. For both, the plots are now dominated by a single peak.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(2, 2, figsize=(10,7))
sns.histplot(pima_mean, x="Insulin", stat="density", ax = axes[0,0])
axes[0,0].set_xlabel('Insulin - Mean')
sns.histplot(pima_median, x="Insulin", stat="density", ax = axes[1,0])
axes[1,0].set_xlabel('Insulin - Median')
sns.histplot(pima_mean, x="SkinThickness", stat="density", ax = axes[0,1])
axes[0,1].set_xlabel('SkinThickness - Mean')
sns.histplot(pima_median, x="SkinThickness", stat="density", ax = axes[1,1])
axes[1,1].set_xlabel('SkinThickness - Median');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
The first machine learning imputer we consider is the k-Nearest Neighbours imputer. This iterates through all the missing values, treating each one as a label, and then finds the corresponding label of its k-Nearest Neighbours. This will be affected by the distance metric that is used, the number of neighbours and the order that the features are imputed. In this case we use the default values and 5 neighbours.
Now the accuracy is approximately the same as for the dataset where we dropped the rows, and higher than using the mean and median. Also, since the ranking of features is more consistent with the original dataset.
%% Cell type:code id: tags:
``` python
from sklearn.impute import KNNImputer
pima_knnn = pima.copy()
X = pima_knnn.iloc[:,0:8]
Xm = X.mean()
Xs = X.std()
X = (X-X.mean())/X.std()
Xt = KNNImputer(n_neighbors=5).fit_transform(X)
pima_knnn.iloc[:,0:8] = Xt
print('Imputation using k-Nearest Neighbours')
pima_knnn.iloc[:,0:8] = Xs*pima_knnn.iloc[:,0:8]+Xm
rf_model(pima_knnn)
```
%%%% Output: stream
Imputation using k-Nearest Neighbours
Testing score is 0.741
Testing score is 0.792
Importance
Glucose 0.206801
Insulin 0.165200
BMI 0.138590
BloodPressure 0.114475
SkinThickness 0.114471
DiabetesPedigreeFunction 0.101659
Age 0.094806
Pregnancies 0.063998
Glucose 0.222666
Insulin 0.165023
BMI 0.137235
Age 0.125017
DiabetesPedigreeFunction 0.109920
SkinThickness 0.100569
BloodPressure 0.073761
Pregnancies 0.065809
%% Cell type:markdown id: tags:
The second method we consider is the sklearn `IterativeImputer`. This is an experimental addition to sklearn, so needs to be enabled as well as imported. As it is experimental, it may change in future versions.
`IterativeImputer` works be marking the missing values, and then repeating the imputation process N times or until the data converges. Initially the missing values are set using a simple scheme, such as being replaced by the mean or median. Then on each iteration a machine learning algorithm is used as a regressor to update each column which is marked as having missing values. The non-missing values are used to train the model, and then the model is used to predict the missing values. Any regression technique could be used to predict the missing values. Common ones that are used are BayesianRidge, k-Nearest Neighbours and Random Forest Regression. Using this algorithm with Random Forest Regression is equivalent to the R routine `missForest`. The routine `KNNImputer` can be seen as `IterativeImputer` with one iteration.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of ``KNNImputer`.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of `KNNImputer`.
%% Cell type:code id: tags:
``` python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
pima_ii = pima.copy()
X = pima_ii.iloc[:,0:8]
Xm = X.mean()
Xs = X.std()
X = (X-X.mean())/X.std()
Xt = IterativeImputer(max_iter=20).fit_transform(X)
pima_ii.iloc[:,0:8] = Xt
print('Iterative imputation using Bayesian Ridge')
pima_ii.iloc[:,0:8] = Xs*pima_ii.iloc[:,0:8]+Xm
rf_model(pima_ii)
```
%%%% Output: stream
Iterative imputation using Bayesian Ridge
Testing score is 0.722
Testing score is 0.779
Importance
Glucose 0.204022
BMI 0.164737
Insulin 0.129289
BloodPressure 0.121702
SkinThickness 0.117260
DiabetesPedigreeFunction 0.109276
Age 0.089874
Pregnancies 0.063840
Glucose 0.227739
BMI 0.148859
Age 0.131762
Insulin 0.125641
DiabetesPedigreeFunction 0.119427
SkinThickness 0.101870
BloodPressure 0.074792
Pregnancies 0.069910
%% Cell type:markdown id: tags:
Plotting the distributions shows that the `KNNImputer` and `IterativeImputer` gives similar results for `Insulin`, but that `IterativeImputer` seems to give a distribution which is more consistent with the original dataset for `SkinThickness`.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(3, 2, figsize=(10,10))
sns.histplot(pima_drop, x="Insulin", stat="density", ax = axes[0,0])
axes[0,0].set_xlabel('Insulin - Drop Rows')
sns.histplot(pima_knnn, x="Insulin", stat="density", ax = axes[1,0])
axes[1,0].set_xlabel('Insulin - kNN')
sns.histplot(pima_ii, x="Insulin", stat="density", ax = axes[2,0])
axes[2,0].set_xlabel('Insulin - Iterative Imputer')
sns.histplot(pima_drop, x="SkinThickness", stat="density", ax = axes[0,1])
axes[0,1].set_xlabel('SkinThickness - Drop Rows')
axes[0,1].set_xlim(0,100)
axes[0,1].set_ylim(0,0.06)
sns.histplot(pima_knnn, x="SkinThickness", stat="density", ax = axes[1,1])
axes[1,1].set_xlabel('SkinThickness - kNN')
axes[1,1].set_xlim(0,100)
axes[1,1].set_ylim(0,0.06)
sns.histplot(pima_ii, x="SkinThickness", stat="density", ax = axes[2,1])
axes[2,1].set_xlabel('SkinThickness - Iterative Imputer')
axes[2,1].set_xlim(0,100)
axes[2,1].set_ylim(0,0.06)
```
%%%% Output: execute_result
(0.0, 0.06)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Cross-validation analysis
%% Cell type:markdown id: tags:
For all the examples so far we have only consider one realisation of the Random Forest Regressor. To understand the effectiveness of the various imputation algorithms we need to combine this with cross validation. The following code consider the variation of the f1-score using Logistic Regression for the imputation strategies:
* Drop rows with missing values.
* Simple imputation using the mean.
* Simple imputation using the median.
* k-Nearest Neighbours imputation.
* Iterative imputation using:
* BayesianRidge,
* DecisionTreeRegressor,
* RandomForestRegressor,
* k-Nearest Neighbours Regression.
The individual scores for each run are stored in a dataframe, for which we can finally investigate the descriptive statistics.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedKFold
N_SPLITS = 10