In this notebook we investigate ways for dealing with missing data using Scikit-Learn's imputation routines. There are three main routines we will discuss: `SimpleImputer`, `KNNImputer` and `IterativeImputer`. We will also only discuss imputing continuous, numerical values, for imputing categorical values possible approaches are using the mode or the `KNNImputer`.
We will use the [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database), which can be downloaded from [Monash Gitlab](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Imputation/pima_indians_diabetes.csv). This aims to predict whether the patient has diabetes from a number of diagnostic measurements. All patients are females of Pima Indian heritage, who are at least 21 years old.
Previously, we have dealt with missing data by deleting that entry. However, that means losing valuable data which contributes to the training of your model. A better approach is to impute the data, i.e., infer the missing data from the existing observations.
We will concentrate here on Scikit-Learn's imputation routines, although some of the techniques, such as replacement of values with the mean or mode, can be easily implemented in Pandas.
%% Cell type:markdown id: tags:
## Contents
%% Cell type:markdown id: tags:
* Introduction
* Cross-validation analysis
* Exercises
%% Cell type:markdown id: tags:
## Introduction
%% Cell type:code id: tags:
``` python
Wefirstimportthestandardlibrariesandthecsvfile.
```
%% Cell type:code id: tags:
``` python
importpandasaspd
importnumpyasnp
frommatplotlibimportpyplotasplt
importseabornassns
importwarnings
warnings.filterwarnings("ignore")
pima=pd.read_csv("pima_indians_diabetes.csv")
```
%% Cell type:markdown id: tags:
We can now view a random sample of the data. In the columns `BloodPressure`, `SkinThickness` and `Insulin` there are values of 0, which are clearly not physical. This is indicative of missing values.
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of 0 for `Pregnancies` is a physically realistic value.
max 17.000000 199.000000 122.000000 99.000000 846.000000
BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000
%% Cell type:markdown id: tags:
To see how many 0 values there are in these fields, we can sum the number of rows which match this criteria. The two fields with the most missing values are `SkinThickness` and `Insulin`.
Since 0 is a valid entry in `Pregnancies` and `Outcome`, it is easiest to mark the missing values as NaN (not a number). This is the default for missing values for the sklearn imputation routines. Marking the values as NaN gives the same number of missing entries as previously.
We now set up a simple random forest model to investigate the effect of a selection of different imputation methods on the accuracy and feature importance. The function below creates a random forest model for the diabetes data, prints the accuracy and feature importance in descending order.
The first example is to drop all rows which have a missing value. This results in approximately half of the dataset being dropped. For this data one of the most important features is `Insulin`, which is the feature with the most missing values.
%% Cell type:code id: tags:
``` python
pima_drop=pima.copy()
pima_drop.dropna(inplace=True)
print('Dropping rows')
print('Shape of array',pima_drop.shape)
print('Shape of original array',pima.shape)
rf_model(pima_drop)
```
%% Cell type:markdown id: tags:
For reference we can plot the distribution of `Insulin` and `SkinThickness` to investigate how different imputation methods affect this. For the other features with missing values the distributions will not be significantly affected, and the exact imputation method probably is not critical to the model.
Replacing the missing values with the mean or the median now results in a decrease in the accuracy of the model, and the feature importance of `SkinThickness` and `Insulin` being ranked very low. This is understandable, as replacing the missing values with a constant results in the reduction of the variance of the feature.
This reduction in variance can be clearly seen by plotting the distributions of the mean and median datasets. For both, the plots are now dominated by a single peak.
The first machine learning imputer we consider is the k-Nearest Neighbours imputer. This iterates through all the missing values, treating each one as a label, and then finds the corresponding label of its k-Nearest Neighbours. This will be affected by the distance metric that is used, the number of neighbours and the order that the features are imputed. In this case we use the default values and 5 neighbours.
Now the accuracy is approximately the same as for the dataset where we dropped the rows, and higher than using the mean and median. Also, since the ranking of features is more consistent with the original dataset.
%% Cell type:code id: tags:
``` python
fromsklearn.imputeimportKNNImputer
pima_knnn=pima.copy()
X=pima_knnn.iloc[:,0:8]
Xm=X.mean()
Xs=X.std()
X=(X-X.mean())/X.std()
Xt=KNNImputer(n_neighbors=5).fit_transform(X)
pima_knnn.iloc[:,0:8]=Xt
print('Imputation using k-Nearest Neighbours')
pima_knnn.iloc[:,0:8]=Xs*pima_knnn.iloc[:,0:8]+Xm
rf_model(pima_knnn)
```
%% Cell type:markdown id: tags:
The second method we consider is the sklearn `IterativeImputer`. This is an experimental addition to sklearn, so needs to be enabled as well as imported. As it is experimental, it may change in future versions.
`IterativeImputer` works be marking the missing values, and then repeating the imputation process N times or until the data converges. Initially the missing values are set using a simple scheme, such as being replaced by the mean or median. Then on each iteration a machine learning algorithm is used as a regressor to update each column which is marked as having missing values. The non-missing values are used to train the model, and then the model is used to predict the missing values. Any regression technique could be used to predict the missing values. Common ones that are used are BayesianRidge, k-Nearest Neighbours and Random Forest Regression. Using this algorithm with Random Forest Regression is equivalent to the R routine `missForest`. The routine `KNNImputer` can be seen as `IterativeImputer` with one iteration.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of `KNNImputer`.
print('Iterative imputation using Bayesian Ridge')
pima_ii.iloc[:,0:8]=Xs*pima_ii.iloc[:,0:8]+Xm
rf_model(pima_ii)
```
%% Cell type:markdown id: tags:
Plotting the distributions shows that the `KNNImputer` and `IterativeImputer` gives similar results for `Insulin`, but that `IterativeImputer` seems to give a distribution which is more consistent with the original dataset for `SkinThickness`.
For all the examples so far we have only consider one realisation of the Random Forest Regressor. To understand the effectiveness of the various imputation algorithms we need to combine this with cross validation. The following code consider the variation of the f1-score using Logistic Regression for the imputation strategies:
* Drop rows with missing values.
* Simple imputation using the mean.
* Simple imputation using the median.
* k-Nearest Neighbours imputation.
* Iterative imputation using:
* BayesianRidge,
* DecisionTreeRegressor,
* RandomForestRegressor,
* k-Nearest Neighbours Regression.
The individual scores for each run are stored in a dataframe, for which we can finally investigate the descriptive statistics.
We can now investigate the descriptive statistics for each imputation method in tabular and graphical format. The green dots in the figure represent the mean values for each method.
The first thing to note is that by using imputation with the full dataset, the variance of the model has been reduced significantly, which suggests that the `Drop Data` model suffers from overfitting. This is consistent with the fact that one way to reduce overfitting is to increase the amount of data. In general, as the complexity of the imputer is increased the accuracy also increases, though the result is dependent on the underlying strategy.
For this example, the best methods for imputation seem to:
* Simple imputation using the median.
* Iterative imputation using:
* BayesianRidge,
* RandomForestRegressor.
As with most modelling the final strategy for imputation depends on the model that you use, and should be decided on after extensive initial testing.