Required files (download these from the Gitlab site [here](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods) into the same directory as the notebook on your computer):
Required files (download these from the Gitlab site [here](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods) into the same directory as the notebook on your computer):
The objective of this notebook is to help you understand some of the terminology, computations and methods behind supervised training of predictive machine learning models. Some of this content will be familiar from the last 3 weeks of semester one where you looked at regression and classification models. Here we take a step back to better understand the concepts behind the modelling methods we will focus on this semester. We'll also review these concepts throughout the semester as we explore further machine learning algorithm types.
The objective of this notebook is to help you understand some of the terminology, computations and methods behind supervised training of predictive machine learning models. Some of this content will be familiar from the last 3 weeks of semester one where you looked at regression and classification models. Here we take a step back to better understand the concepts behind the modelling methods we will focus on this semester. We'll also review these concepts throughout the semester as we explore further machine learning algorithm types.
%% Cell type:code id:6057e0ae tags:
%% Cell type:code id:6057e0ae tags:
``` python
``` python
# Remember these? Our usual package imports for handling data.
# Remember these? Our usual package imports for handling data.
importnumpyasnp
importnumpyasnp
importpandasaspd
importpandasaspd
importseabornassns
importseabornassns
# Specialised functions for calculating prediction error rates.
# Specialised functions for calculating prediction error rates.
fromsklearn.metricsimportprecision_score
fromsklearn.metricsimportprecision_score
```
```
%% Cell type:markdown id:9afd497a tags:
%% Cell type:markdown id:9afd497a tags:
## Supervised Learning
## Supervised Learning
Simply put, **supervised learning** is a process of **training** a machine learning **model** based on a sampled dataset with known inputs and outputs. Two key types of supervised learning models are **regression** and **classification** models. We'll cover both this semester; and you've already seen some in semester one (linear regression and kNN classification).
Simply put, **supervised learning** is a process of **training** a machine learning **model** based on a sampled dataset with known inputs and outputs. Two key types of supervised learning models are **regression** and **classification** models. We'll cover both this semester; and you've already seen some in semester one (linear regression and kNN classification).
Let's first look at some examples of datasets used for regression and classification tasks.
Let's first look at some examples of datasets used for regression and classification tasks.
### WHO Life Expectancy Tables
### WHO Life Expectancy Tables
The WHO tracks various metrics for countries as they relate to life expectancy, and mortality rates of different populations. This dataset was originally found here https://www.kaggle.com/kumarajarshi/life-expectancy-who/. Here we'll consider just one input (education level), and one output (adult life expectancy). With this simple dataset we could build a model to predict average life expectancy based on education levels within a country. This is a regression task; the output variable is a continuous measure.
The WHO tracks various metrics for countries as they relate to life expectancy, and mortality rates of different populations. This dataset was originally found here https://www.kaggle.com/kumarajarshi/life-expectancy-who/. Here we'll consider just one input (education level), and one output (adult life expectancy). With this simple dataset we could build a model to predict average life expectancy based on education levels within a country. This is a regression task; the output variable is a continuous measure.
%% Cell type:code id:969a166a tags:
%% Cell type:code id:969a166a tags:
``` python
``` python
# Read the dataset into a dataframe.
# Read the dataset into a dataframe.
who_data_2015=(
who_data_2015=(
pd.read_csv("who-health-data.csv")# Read in the csv data.
pd.read_csv("who-health-data.csv")# Read in the csv data.
.rename(columns=lambdac:c.strip())# Clean up column names.
.rename(columns=lambdac:c.strip())# Clean up column names.
.query("Year == 2015")# Restrict the dataset to records from 2015.
.query("Year == 2015")# Restrict the dataset to records from 2015.
This dataset measures various properties of breast cancer biopsy results, along with a diagnosis of whether the tumour being tested is malignant or benign. The dataset was originally found here https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.
This dataset measures various properties of breast cancer biopsy results, along with a diagnosis of whether the tumour being tested is malignant or benign. The dataset was originally found here https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.
There are many features here, but we'll consider just two: radius (size of the tissue sample) and texture (variation in colour across the surface). The diagnosis is a binary state (malignant/benign or postive/negative) so this is a classification dataset.
There are many features here, but we'll consider just two: radius (size of the tissue sample) and texture (variation in colour across the surface). The diagnosis is a binary state (malignant/benign or postive/negative) so this is a classification dataset.
%% Cell type:code id:581e0329 tags:
%% Cell type:code id:581e0329 tags:
``` python
``` python
# Read dataset into a dataframe.
# Read dataset into a dataframe.
wisconsin_cancer_biopsies=(
wisconsin_cancer_biopsies=(
pd.read_csv("kaggle-wisconsin-cancer.csv")
pd.read_csv("kaggle-wisconsin-cancer.csv")
# This tidies up the naming of results (M -> malignant, B -> benign)
# This tidies up the naming of results (M -> malignant, B -> benign)
.assign(diagnosis=lambdadf:df['diagnosis']
.assign(diagnosis=lambdadf:df['diagnosis']
.map({"M":"malignant","B":"benign"})
.map({"M":"malignant","B":"benign"})
.astype('category')
.astype('category')
)
)
)
)
# Show the true diagnosis as a function of two variables.
# Show the true diagnosis as a function of two variables.
Supervised learning can be thought of as selecting among many possible models/parameters for the best one which fits to the data we have. Fundamental to this idea is having a way to measure error/coming up with a useful error metric. We want to select from among candidate models the one which yields the best predictions, in other words the one with the smallest measured error.
Supervised learning can be thought of as selecting among many possible models/parameters for the best one which fits to the data we have. Fundamental to this idea is having a way to measure error/coming up with a useful error metric. We want to select from among candidate models the one which yields the best predictions, in other words the one with the smallest measured error.
### Regression Error Metrics
### Regression Error Metrics
Let's have a look at a simple set of predictions of life expectancy based on schooling in the WHO dataset. Here we've constructed a simple prediction where
Let's have a look at a simple set of predictions of life expectancy based on schooling in the WHO dataset. Here we've constructed a simple prediction where
For each sample point in the dataset, we compute prediction error by finding the difference between the value predicted by our simple model and the actual value. Below these errors are plotted as a histogram. We can see there is a roughly even spread of over- and under- predictions.
For each sample point in the dataset, we compute prediction error by finding the difference between the value predicted by our simple model and the actual value. Below these errors are plotted as a histogram. We can see there is a roughly even spread of over- and under- predictions.
%% Cell type:code id:217b3af6 tags:
%% Cell type:code id:217b3af6 tags:
``` python
``` python
# Compute errors (difference between predicted and correct values).
# Compute errors (difference between predicted and correct values).
From these errors, we can compute a single statistic representing the prediction error of this model. This is typically the Mean Absolute Error: average deviation of predictions from the true value. Mathematically:
From these errors, we can compute a single statistic representing the prediction error of this model. This is typically the Mean Absolute Error: average deviation of predictions from the true value. Mathematically:
where there are $N$ samples in the **training set**.
where there are $N$ samples in the **training set**.
%% Cell type:code id:e1bdbf41 tags:
%% Cell type:code id:e1bdbf41 tags:
``` python
``` python
# Compute mean absolute error.
# Compute mean absolute error.
mae_score=errors.abs().mean()
mae_score=errors.abs().mean()
print(f"Mean absolute error is {mae_score:.1f} years")
print(f"Mean absolute error is {mae_score:.1f} years")
```
```
%%%% Output: stream
%%%% Output: stream
Mean absolute error is 3.9 years
Mean absolute error is 3.9 years
%% Cell type:markdown id:e74bc330 tags:
%% Cell type:markdown id:e74bc330 tags:
### Classification Error Metrics
### Classification Error Metrics
In classification tasks, we break down several statistics based on the number of correct and incorrect predictions, and the number of those which correspond to positive and negative labels. Let's look at a small subset of the biopsy data to understand these metrics.
In classification tasks, we break down several statistics based on the number of correct and incorrect predictions, and the number of those which correspond to positive and negative labels. Let's look at a small subset of the biopsy data to understand these metrics.
%% Cell type:code id:6609e150 tags:
%% Cell type:code id:6609e150 tags:
``` python
``` python
# Plot a small set of 20 sample points.
# Plot a small set of 20 sample points.
# .sample(20) selects 20 random points. Using the random_state parameter
# .sample(20) selects 20 random points. Using the random_state parameter
# ensures that the selected points are always the same.
# ensures that the selected points are always the same.
As an example, we develop a simple prediction here: all tests with a `radius_mean` value of less than 14 is considered benign, and anything larger is considered malignant. Below we plot a breakdown of the data by both its predicted label and its correct label.
As an example, we develop a simple prediction here: all tests with a `radius_mean` value of less than 14 is considered benign, and anything larger is considered malignant. Below we plot a breakdown of the data by both its predicted label and its correct label.