Required files (download these from the Gitlab site [here](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods) into the same directory as the notebook on your computer):
The objective of this notebook is to help you understand some of the terminology, computations and methods behind supervised training of predictive machine learning models. Some of this content will be familiar from the last 3 weeks of semester one where you looked at regression and classification models. Here we take a step back to better understand the concepts behind the modelling methods we will focus on this semester. We'll also review these concepts throughout the semester as we explore further machine learning algorithm types.
%% Cell type:code id:6057e0ae tags:
``` python
# Remember these? Our usual package imports for handling data.
importnumpyasnp
importpandasaspd
importseabornassns
# Specialised functions for calculating prediction error rates.
fromsklearn.metricsimportprecision_score
```
%% Cell type:markdown id:9afd497a tags:
## Supervised Learning
Simply put, **supervised learning** is a process of **training** a machine learning **model** based on a sampled dataset with known inputs and outputs. Two key types of supervised learning models are **regression** and **classification** models. We'll cover both this semester; and you've already seen some in semester one (linear regression and kNN classification).
Let's first look at some examples of datasets used for regression and classification tasks.
### WHO Life Expectancy Tables
The WHO tracks various metrics for countries as they relate to life expectancy, and mortality rates of different populations. This dataset was originally found here https://www.kaggle.com/kumarajarshi/life-expectancy-who/. Here we'll consider just one input (education level), and one output (adult life expectancy). With this simple dataset we could build a model to predict average life expectancy based on education levels within a country. This is a regression task; the output variable is a continuous measure.
%% Cell type:code id:969a166a tags:
``` python
# Read the dataset into a dataframe.
who_data_2015=(
pd.read_csv("who-health-data.csv")# Read in the csv data.
.rename(columns=lambdac:c.strip())# Clean up column names.
.query("Year == 2015")# Restrict the dataset to records from 2015.
This dataset measures various properties of breast cancer biopsy results, along with a diagnosis of whether the tumour being tested is malignant or benign. The dataset was originally found here https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.
There are many features here, but we'll consider just two: radius (size of the tissue sample) and texture (variation in colour across the surface). The diagnosis is a binary state (malignant/benign or postive/negative) so this is a classification dataset.
%% Cell type:code id:581e0329 tags:
``` python
# Read dataset into a dataframe.
wisconsin_cancer_biopsies=(
pd.read_csv("kaggle-wisconsin-cancer.csv")
# This tidies up the naming of results (M -> malignant, B -> benign)
.assign(diagnosis=lambdadf:df['diagnosis']
.map({"M":"malignant","B":"benign"})
.astype('category')
)
)
# Show the true diagnosis as a function of two variables.
Supervised learning can be thought of as selecting among many possible models/parameters for the best one which fits to the data we have. Fundamental to this idea is having a way to measure error/coming up with a useful error metric. We want to select from among candidate models the one which yields the best predictions, in other words the one with the smallest measured error.
### Regression Error Metrics
Let's have a look at a simple set of predictions of life expectancy based on schooling in the WHO dataset. Here we've constructed a simple prediction where
For each sample point in the dataset, we compute prediction error by finding the difference between the value predicted by our simple model and the actual value. Below these errors are plotted as a histogram. We can see there is a roughly even spread of over- and under- predictions.
%% Cell type:code id:217b3af6 tags:
``` python
# Compute errors (difference between predicted and correct values).
From these errors, we can compute a single statistic representing the prediction error of this model. This is typically the Mean Absolute Error: average deviation of predictions from the true value. Mathematically:
where there are $N$ samples in the **training set**.
%% Cell type:code id:e1bdbf41 tags:
``` python
# Compute mean absolute error.
mae_score=errors.abs().mean()
print(f"Mean absolute error is {mae_score:.1f} years")
```
%%%% Output: stream
Mean absolute error is 3.9 years
%% Cell type:markdown id:e74bc330 tags:
### Classification Error Metrics
In classification tasks, we break down several statistics based on the number of correct and incorrect predictions, and the number of those which correspond to positive and negative labels. Let's look at a small subset of the biopsy data to understand these metrics.
%% Cell type:code id:6609e150 tags:
``` python
# Plot a small set of 20 sample points.
# .sample(20) selects 20 random points. Using the random_state parameter
# ensures that the selected points are always the same.
As an example, we develop a simple prediction here: all tests with a `radius_mean` value of less than 14 is considered benign, and anything larger is considered malignant. Below we plot a breakdown of the data by both its predicted label and its correct label.
Visualising the data this way, we can define categories of the data points by example. We'll call a malignant sample a 'positive' classification, and benign a 'negative'. So we have:
* 7 **True positives** (cases where we correctly predicted a malignant, or 'positive', sample)
* 10 **True negatives** (cases where we correctly predicted a benign, or 'negative', sample)
* 1 **False positive** (we predicted malignant, but the sample is actually benign)
* 2 **False negatives** (we predicted benign, but the sample is actually malignant)
%% Cell type:code id:862bf07a tags:
``` python
# I've counted the above results manually from the plot; here's
As with Mean Absolute Error in the regression task, we would prefer a measure which is somehow averaged, so that it isn't affected by the size of the dataset. The metrics we use are **accuracy**, **precision** and **recall**. These terms are defined below and computed for this case.
* Accuracy: proportion of all predictions that were correct.
* Precision: proportion of positive predictions which were correct.
* Recall: proporition of actual positive results which were correctly identified.
We typically report all of these metrics for a classification task, however depending on the context, different metrics may be considered more important (we might be happy with higher precision or better recall at the expense of accuracy). It's not always the case that higher accuracy is preferred.
%% Cell type:code id:018ad8ba tags:
``` python
TP=7# True positives
TN=10# True negatives
FP=1# False positives
FN=2# False negatives
TOTAL=TP+TN+FP+FN
print(f"Accuracy = {(TP+TN)/TOTAL=:.3f}")
print(f"Precision = {TP/(TP+FP)=:.3f}")
print(f"Recall = {TP/(TP+FN)=:.3f}")
```
%%%% Output: stream
Accuracy = (TP + TN) / TOTAL = 0.850
Precision = TP / (TP + FP) = 0.875
Recall = TP / (TP + FN) = 0.778
%% Cell type:code id:69c35288 tags:
``` python
# These metrics can (and should!) be calculated automatically using sklearn's
# scoring functions. They are calculated manually above to show the process,
# but you should get used to using these built in methods.
precision_score(
y_true=biopsies_with_predictions['diagnosis'],
y_pred=biopsies_with_predictions['predicted'],
pos_label="malignant",# we need to identify which value should
# be considered 'positive' for this metric
)
```
%%%% Output: execute_result
0.875
%% Cell type:markdown id:03cfa52c tags:
## Model Fitting
Machine learning models have an underlying mathematical form; the output prediction is given as a (sometimes complex) mathematical function of the input data. This mathematical function has **parameters** we need to set when choosing the model. Examples of these parameters are the gradient and intercept (2.5 and 38, respectively) of the life expectancy prediction model above, or the radius_mean cutoff value (14) in the biopsy diagnosis model.
For example we might propose more general models to those we have briefly introduced above. Using two parameters $m$ and $c$, we could suggest models of the form:
$$
\text{Life expectancy} = M \times \text{Years of schooling} + C \text{ years}
$$
or for the cancer biopsy dataset, we could suggest a model using a single parameter $R$:
$$
\text{Predicted diagnosis} = \begin{cases}
\text{malignant, if radius_mean} \ge R \\
\text{benign, otherwise.} \\
\end{cases}
$$
When **fitting** a model to the data, we aim to choose parameter values ($M$, $C$, and $R$) automatically which minimise our error rates. This process is a form of **optimisation**, and involves computing a **cost function** as a function of the model parameters. This cost function usually closely aligns with one of the error metrics discussed above. An optimisation algorithm finds the parameter values which minimise this cost function, hence giving the lowest possible rate of error.
The code below shows how the cost function varies for a simple one-parameter model for the regression task on the WHO dataset.
%% Cell type:code id:2e144e73 tags:
``` python
# Note, this is a laborious way to do this; it is intended as an illustration.
# In practice, we'll use automated methods to fit models (mostly in the sklearn
# library).
defprediction_error(m):
""" Return the prediction error associated with the value
of the parameter. Note that this model uses only a single
parameter - the intercept C is fixed by the choice of M.
This is an illustration of the **cost function** of this model: how the error measure varies with the parameter value choice. We can see that the lowest MAE is at around about a parameter value of $2.0$. Selecting this parameter value gives our **fitted model**. You'll see that this model has a slightly lower error (below $3.6$, where previously we saw $3.9$) than the choice we made earlier. The predictions made by this model are shown below.
The above examples outline the process of **training** a supervised machine learning model: fitting it to known data by choosing its parameters so that error is minimised. In practice, we need to split our data into training and testing sets before conducting the training process. The cost function for fitting the model is based on the training data, but we use the test data to assess how the fitted model performs and compute our final estimates of model error.
We'll discuss this further in class in week 2.
%% Cell type:markdown id:fc692817 tags:
## Exercises
Complete these before the studio in week 2, and submit the completed notebook through Moodle. We'll go over the results in class and complete some further activities to help your understanding.
%% Cell type:markdown id:3a91b1d6 tags:
### Exercise 1
Given the dataframe `ex1_who_with_predictions` below, compute the Mean Absolute Error for the predicted values of life expectancy. You can repeat the process previously shown, or find a function in `sklearn.metrics` to compute this for you.