Previously, we have dealt with missing data by deleting that entry. However, that means losing valuable data which contributes to the training of your model. A better approach is to impute the data, i.e., infer the missing data from the existing observations.
We will concentrate here on Scikit-Learn's imputation routines, although some of the techniques, such as replacement of values with the mean or mode, can be easily implemented in Pandas.
%% Cell type:markdown id: tags:
## Contents
%% Cell type:markdown id: tags:
* Introduction
* Cross-validation analysis
* Exercises
%% Cell type:markdown id: tags:
## Introduction
%% Cell type:markdown id: tags:
We first import the standard libraries and the csv file.
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of `Pregnancies`of 0, is a physically realistic value.
This can be investigated further by displaying the descriptive statistics, for which it is apparent that `Glucose` and `BMI` also have unrealistic values of 0. A value of 0 for `Pregnancies` is a physically realistic value.
The second method we consider is the sklearn `IterativeImputer`. This is an experimental addition to sklearn, so needs to be enabled as well as imported. As it is experimental, it may change in future versions.
`IterativeImputer` works be marking the missing values, and then repeating the imputation process N times or until the data converges. Initially the missing values are set using a simple scheme, such as being replaced by the mean or median. Then on each iteration a machine learning algorithm is used as a regressor to update each column which is marked as having missing values. The non-missing values are used to train the model, and then the model is used to predict the missing values. Any regression technique could be used to predict the missing values. Common ones that are used are BayesianRidge, k-Nearest Neighbours and Random Forest Regression. Using this algorithm with Random Forest Regression is equivalent to the R routine `missForest`. The routine `KNNImputer` can be seen as `IterativeImputer` with one iteration.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of ``KNNImputer`.
In this example, we use the default algorithm, BayesianRidge. This gives that the testing score slightly, however the feature importance is consistent with the original dataset and the results of `KNNImputer`.
For all the examples so far we have only consider one realisation of the Random Forest Regressor. To understand the effectiveness of the various imputation algorithms we need to combine this with cross validation. The following code consider the variation of the f1-score using Logistic Regression for the imputation strategies:
For the exercises we will use the [Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/Abalone), which can be downloaded from [Monash Gitlab](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Imputation/abalone.csv). This consists of physical measurements of abalones from the Tasmanian coast in the 1990s, in an effort to determine their age. Previously the age would need to be determined in the laboratory by counting the number of rings in the shell. Then $Age = Rings + 1.5$. This is a complete dataset, however we will randomly remove entries in two columns to perform imputation.
First we load the dataset.
%% Cell type:code id: tags:
``` python
abalone=pd.read_csv("abalone.csv")
abalone.head()
```
%% Cell type:markdown id: tags:
The `Sex` field has three categorical entries: Male (M), Female (F) and Infant (I). Se we need to one-hot encode these fields to create three binary columns.
Last we create a features array (Xf) and a label array (Yf). Then we randomly remove 33% of the `Height` samples and 25% of the `Shell weight` samples from the features array.
Create a Random Forest Regressor model for the full dataset and determine the accuracy of this model. Use an 80:20 split for training and testing.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Exercise 2 (3 marks)
%% Cell type:markdown id: tags:
Fill in the missing values of X using `IterativeImputer` with 10 iterations and using the `BayesianRidge` regressor. Calculate the accuracy of the Random Forest Regressor using this imputed dataset.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Exercise 3 (5 marks)
%% Cell type:markdown id: tags:
Fill in the missing values of X using `IterativeImputer` with 10 iterations and using `KNeighborsRegressor` for 5, 10, 15 and 20 neighbours. Calculate the accuracy of the Random Forest Regressor using each of these imputed datasets.