Commit 1adb9ff8 authored by Simon Clarke's avatar Simon Clarke
Browse files

Added imputation notebook

parent 4d4e6d94
%% Cell type:markdown id: tags:
# Imputing Missing Data
%% Cell type:markdown id: tags:
In this notebook we investigate ways for dealing with missing data using Scikit-Learn's imputation routines. There are three main routines we will discuss: `SimpleImputer`, `KNNImputer` and `IterativeImputer`. We will also only discuss imputing continuous, numerical values, for imputing categorical values possible approaches are using the mode or the `KNNImputer`.
We will use the [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database), which can be downloaded from [Monash Gitlab](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Imputation/pima_indians_diabetes.csv). This aims to predict whether the patient has diabetes from a number of diagnostic measurements. All patients are females of Pima Indian heritage, who are at least 21 years old.
Previously, we have dealt with missing data by deleting that entry. However, that means losing valuable data which contributes to the training of your model. A better approach is to impute the data, i.e., infer the missing data from the existing observations.
We will concentrate here on Scikit-Learn's imputation routines, although some of the techniques, such as replacement of values with the mean or mode, can be easily implemented in Pandas.
We first import the standard libraries and the csv file.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pima = pd.read_csv("pima_indians_diabetes.csv")
```
%% Cell type:markdown id: tags:
We can now view a random sample of the data. In the columns `BloodPressure`, `SkinThickness` and `Insulin` there are values of 0, which are clearly not physical. This is indicative of missing values.
%% Cell type:code id: tags:
``` python
pima.sample(20)
```
%%%% Output: execute_result
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
480 3 158 70 30 328 35.5
11 10 168 74 0 0 38.0
111 8 155 62 26 495 34.0
186 8 181 68 36 495 30.1
396 3 96 56 34 115 24.7
258 1 193 50 16 375 25.9
26 7 147 76 0 0 39.4
679 2 101 58 17 265 24.2
139 5 105 72 29 325 36.9
269 2 146 0 0 0 27.5
589 0 73 0 0 0 21.1
547 4 131 68 21 166 33.1
401 6 137 61 0 0 24.2
168 4 110 66 0 0 31.9
12 10 139 80 0 0 27.1
412 1 143 84 23 310 42.4
6 3 78 50 32 88 31.0
153 1 153 82 42 485 40.6
71 5 139 64 35 140 28.6
632 2 111 60 0 0 26.2
DiabetesPedigreeFunction Age Outcome
480 0.344 35 1
11 0.537 34 1
111 0.543 46 1
186 0.615 60 1
396 0.944 39 0
258 0.655 24 0
26 0.257 43 1
679 0.614 23 0
139 0.159 28 0
269 0.240 28 1
589 0.342 25 0
547 0.160 28 0
401 0.151 55 0
168 0.471 29 0
12 1.441 57 0
412 1.076 22 0
6 0.248 26 1
153 0.687 23 0
71 0.411 26 0
632 0.343 23 0
%% Cell type:markdown id: tags:
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of `Pregnancies` of 0, is a physically realistic value.
%% Cell type:code id: tags:
``` python
pima.describe()
```
%%%% Output: execute_result
Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000
BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000
%% Cell type:markdown id: tags:
To see how many 0 values there are in these fields, we can sum the number of rows which match this criteria. The two fields with the most missing values are `SkinThickness` and `Insulin`.
%% Cell type:code id: tags:
``` python
print((pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] == 0).sum())
```
%% Cell type:markdown id: tags:
Since 0 is a valid entry in `Pregnancies` and `Outcome`, it is easiest to mark the missing values as NaN (not a number). This is the default for missing values for the sklearn imputation routines. Marking the values as NaN gives the same number of missing entries as previously.
%% Cell type:code id: tags:
``` python
pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = \
pima[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
print(pima.isnull().sum())
```
%% Cell type:markdown id: tags:
We now set up a simple random forest model to investigate the effect of a selection of different imputation methods on the accuracy and feature importance. The function below creates a random forest model for the diabetes data, prints the accuracy and feature importance in descending order.
%% Cell type:code id: tags:
``` python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def rf_model(pimadf):
Xf = pimadf.drop(columns=['Outcome'])
Yf = np.ravel(pimadf[['Outcome']])
X_train, X_test, Y_train, Y_test = train_test_split(Xf,Yf,test_size=0.8,random_state=0)
rfc = RandomForestClassifier()
rfc.fit(X_train,Y_train) # fit the data to the model
Y_pred = rfc.predict(X_test)
acc = accuracy_score(Y_test,Y_pred)
print("Testing score is %5.3f" % acc)
feature_importances = pd.DataFrame(rfc.feature_importances_,
index = X_train.columns,
columns=['Importance']).sort_values('Importance', ascending=False)
print(feature_importances)
```
%% Cell type:markdown id: tags:
The first example is to drop all rows which have a missing value. This results in approximately half of the dataset being dropped. For this data one of the most important features is `Insulin`, which is the feature with the most missing values.
%% Cell type:code id: tags:
``` python
pima_drop = pima.copy()
pima_drop.dropna(inplace=True)
print('Dropping rows')
print('Shape of array',pima_drop.shape)
print('Shape of original array',pima.shape)
rf_model(pima_drop)
```
%% Cell type:markdown id: tags:
For reference we can plot the distribution of `Insulin` and `SkinThickness` to investigate how different imputation methods affect this. For the other features with missing values the distributions will not be significantly affected, and the exact imputation method probably is not critical to the model.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(1, 2, figsize=(10,4))
sns.histplot(pima_drop, x="Insulin", stat="density", ax = axes[0])
sns.histplot(pima_drop, x="SkinThickness", stat="density", ax = axes[1])
axes[1].set_xlim(0,100)
axes[1].set_ylim(0,0.06)
```
%%%% Output: execute_result
(0.0, 0.06)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Replacing the missing values with the mean or the median now results in a decrease in the accuracy of the model, and the feature importance of `SkinThickness` and `Insulin` being ranked very low. This is understandable, as replacing the missing values with a constant results in the reduction of the variance of the feature.
%% Cell type:code id: tags:
``` python
pima_mean = pima.copy()
pima_mean.fillna(pima_mean.mean(), inplace=True)
print('Imputation using mean')
rf_model(pima_mean)
```
%% Cell type:code id: tags:
``` python
pima_median = pima.copy()
pima_median.fillna(pima_mean.median(), inplace=True)
print('Imputation using median')
rf_model(pima_median)
```
%% Cell type:markdown id: tags:
This reduction in variance can be clearly seen by plotting the distributions of the mean and median datasets. For both, the plots are now dominated by a single peak.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(2, 2, figsize=(10,7))
sns.histplot(pima_mean, x="Insulin", stat="density", ax = axes[0,0])
axes[0,0].set_xlabel('Insulin - Mean')
sns.histplot(pima_median, x="Insulin", stat="density", ax = axes[1,0])
axes[1,0].set_xlabel('Insulin - Median')
sns.histplot(pima_mean, x="SkinThickness", stat="density", ax = axes[0,1])
axes[0,1].set_xlabel('SkinThickness - Mean')
sns.histplot(pima_median, x="SkinThickness", stat="density", ax = axes[1,1])
axes[1,1].set_xlabel('SkinThickness - Median');
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
The first machine learning imputer we consider is the k-Nearest Neighbours imputer. This iterates through all the missing values, treating each one as a label, and then finds the corresponding label of its k-Nearest Neighbours. This will be affected by the distance metric that is used, the number of neighbours and the order that the features are imputed. In this case we use the default values and 5 neighbours.
Now the accuracy is approximately the same as for the dataset where we dropped the rows, and higher than using the mean and median. Also, since the ranking of features is more consistent with the original dataset.
%% Cell type:code id: tags:
``` python
from sklearn.impute import KNNImputer
pima_knnn = pima.copy()
X = pima_knnn.iloc[:,0:8]
Xm = X.mean()
Xs = X.std()
X = (X-X.mean())/X.std()
Xt = KNNImputer(n_neighbors=5).fit_transform(X)
pima_knnn.iloc[:,0:8] = Xt
print('Imputation using k-Nearest Neighbours')
pima_knnn.iloc[:,0:8] = Xs*pima_knnn.iloc[:,0:8]+Xm
rf_model(pima_knnn)
```
%% Cell type:markdown id: tags:
The second method we consider is the sklearn `IterativeImputer`. This is an experimental addition to sklearn, so needs to be enabled as well as imported. As it is experimental, it may change in future versions.
`IterativeImputer` works be marking the missing values, and then repeating the imputation process N times or until the data converges. Initially the missing values are set using a simple scheme, such as being replaced by the mean or median. Then on each iteration a machine learning algorithm is used as a regressor to update each column which is marked as having missing values. The non-missing values are used to train the model, and then the model is used to predict the missing values. Any regression technique could be used to predict the missing values. Common ones that are used are BayesianRidge, k-Nearest Neighbours and Random Forest Regression. Using this algorithm with Random Forest Regression is equivalent to the R routine `missForest`. The routine `KNNImputer` can be seen as `IterativeImputer` with one iteration.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of ``KNNImputer`.
%% Cell type:code id: tags:
``` python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
pima_ii = pima.copy()
X = pima_ii.iloc[:,0:8]
Xm = X.mean()
Xs = X.std()
X = (X-X.mean())/X.std()
Xt = IterativeImputer(max_iter=20).fit_transform(X)
pima_ii.iloc[:,0:8] = Xt
print('Iterative imputation using Bayesian Ridge')
pima_ii.iloc[:,0:8] = Xs*pima_ii.iloc[:,0:8]+Xm
rf_model(pima_ii)
```
%% Cell type:markdown id: tags:
Plotting the distributions shows that the `KNNImputer` and `IterativeImputer` gives similar results for `Insulin`, but that `IterativeImputer` seems to give a distribution which is more consistent with the original dataset for `SkinThickness`.
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(3, 2, figsize=(10,10))
sns.histplot(pima_drop, x="Insulin", stat="density", ax = axes[0,0])
axes[0,0].set_xlabel('Insulin - Drop Rows')
sns.histplot(pima_knnn, x="Insulin", stat="density", ax = axes[1,0])
axes[1,0].set_xlabel('Insulin - kNN')
sns.histplot(pima_ii, x="Insulin", stat="density", ax = axes[2,0])
axes[2,0].set_xlabel('Insulin - Iterative Imputer')
sns.histplot(pima_drop, x="SkinThickness", stat="density", ax = axes[0,1])
axes[0,1].set_xlabel('SkinThickness - Drop Rows')
axes[0,1].set_xlim(0,100)
axes[0,1].set_ylim(0,0.06)
sns.histplot(pima_knnn, x="SkinThickness", stat="density", ax = axes[1,1])
axes[1,1].set_xlabel('SkinThickness - kNN')
axes[1,1].set_xlim(0,100)
axes[1,1].set_ylim(0,0.06)
sns.histplot(pima_ii, x="SkinThickness", stat="density", ax = axes[2,1])
axes[2,1].set_xlabel('SkinThickness - Iterative Imputer')
axes[2,1].set_xlim(0,100)
axes[2,1].set_ylim(0,0.06)
```
%%%% Output: execute_result
(0.0, 0.06)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
For all the examples so far we have only consider one realisation of the Random Forest Regressor. To understand the effectiveness of the various imputation algorithms we need to combine this with cross validation. The following code consider the variation of the f1-score using Logistic Regression for the imputation strategies:
* Drop rows with missing values.
* Simple imputation using the mean.
* Simple imputation using the median.
* k-Nearest Neighbours imputation.
* Iterative imputation using:
* BayesianRidge,
* DecisionTreeRegressor,
* RandomForestRegressor,
* k-Nearest Neighbours Regression.
The individual scores for each run are stored in a dataframe, for which we can finally investigate the descriptive statistics.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
N_SPLITS = 10
X_full = pima_drop.drop(columns=['Outcome'])
Y_full = np.ravel(pima_drop[['Outcome']])
lr_estimator = LogisticRegression(solver='newton-cg', C=1.e3)
score_drop = pd.DataFrame(
cross_val_score(
lr_estimator, X_full, Y_full, scoring='f1', cv=N_SPLITS
),
columns=['Drop Data']
)
```
%% Cell type:code id: tags:
``` python
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
# Estimate the score after imputation (mean and median strategies)
X_missing = pima.drop(columns=['Outcome'])
Y_missing = np.ravel(pima[['Outcome']])
score_simple_imputer = pd.DataFrame()
for strategy in ('mean', 'median'):
estimator = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy=strategy),
lr_estimator
)
score_simple_imputer[strategy] = cross_val_score(
estimator, X_missing, Y_missing, scoring='f1',
cv=N_SPLITS
)
```
%% Cell type:code id: tags:
``` python
from sklearn.impute import KNNImputer
score_knn_imputer = pd.DataFrame()
estimator = make_pipeline(
KNNImputer(n_neighbors=15),
rf_estimator
)
score_knn_imputer['KNeighborsRegressor'] = cross_val_score(
estimator, X_missing, Y_missing, scoring='f1',
cv=N_SPLITS
)
```
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
estimators = [
BayesianRidge(),
DecisionTreeRegressor(random_state=0),
RandomForestRegressor(random_state=0),
KNeighborsRegressor(n_neighbors=15)
]
score_iterative_imputer = pd.DataFrame()
for impute_estimator in estimators:
estimator = make_pipeline(
IterativeImputer(random_state=0, estimator=impute_estimator, max_iter=10),
lr_estimator
)
score_iterative_imputer[impute_estimator.__class__.__name__] = \
cross_val_score(
estimator, X_missing, Y_missing, scoring='f1', cv=N_SPLITS
)
```
%% Cell type:markdown id: tags:
We can now investigate the descriptive statistics for each imputation method in tabular and graphical format. The green dots in the figure represent the mean values for each method.
The first thing to note is that by using imputation with the full dataset, the variance of the model has been reduced significantly, which suggests that the `Drop Data` model suffers from overfitting. This is consistent with the fact that one way to reduce overfitting is to increase the amount of data. In general, as the complexity of the imputer is increased the accuracy also increases, though the result is dependent on the underlying strategy.
For this example, the best methods for imputation seem to:
* Simple imputation using the median.
* Iterative imputation using:
* BayesianRidge,
* RandomForestRegressor.
As with most modelling the final strategy for imputation depends on the model that you use, and should be decided on after extensive initial testing.
%% Cell type:code id: tags:
``` python
scores = pd.concat(
[score_drop, score_simple_imputer, score_knn_imputer, score_iterative_imputer],
keys=['Original', 'SimpleImputer', 'KNN', 'IterativeImputer'], axis=1
)
scores.describe()
```
%%%% Output: execute_result
Original SimpleImputer KNN \
Drop Data mean median KNeighborsRegressor
count 10.000000 10.000000 10.000000 10.000000
mean 0.621762 0.626776 0.632280 0.626371
std 0.132392 0.062190 0.055559 0.059190
min 0.352941 0.520000 0.541667 0.520000
25% 0.543478 0.605662 0.605662 0.605662
50% 0.641026 0.632874 0.640256 0.626225
75% 0.718531 0.663265 0.666667 0.663265
max 0.782609 0.727273 0.727273 0.711111
IterativeImputer \
BayesianRidge DecisionTreeRegressor RandomForestRegressor
count 10.000000 10.000000 10.000000
mean 0.635335 0.629131 0.633366
std 0.056181 0.059494 0.049522
min 0.541667 0.541667 0.541667
25% 0.613384 0.588889 0.612077
50% 0.637331 0.625000 0.645680
75% 0.663265 0.663462 0.663462
max 0.727273 0.739130 0.711111
KNeighborsRegressor
count 10.000000
mean 0.627487
std 0.057144
min 0.541667
25% 0.584496
50% 0.626225
75% 0.666667
max 0.711111
%% Cell type:code id: tags:
``` python
# plot results
fig, ax = plt.subplots(figsize=(13, 6))
scores.plot.box(showmeans=True, vert=False, ax=ax)
ax.set_title('Pima Indian Diabetes Classification with Different Imputation Methods')
ax.set_xlabel('Accuracy (larger is better)')
plt.show()
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Exercises
%% Cell type:code id: tags:
``` python
```
Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30,0.484,32,1