Unverified Commit c3c92e21 authored by Simon Bowly's avatar Simon Bowly
Browse files

Week 2 activity

parent 5ceae43e
%% Cell type:markdown id:a3948254 tags:
# ADS1002 Week 2: Minimising errors to fit models
%% Cell type:code id:ce0cb9ab tags:
``` python
# Load the same datasets as in the introductory notebook.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Note: you can read data files directly from the internet
# using pd.read_csv(...)
base = "https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/raw/main/Machine-Learning/Supervised-Methods/"
who_data_2015 = (
pd.read_csv(base + "who-health-data.csv") # Read in the csv data.
.rename(columns=lambda c: c.strip()) # Clean up column names.
.query("Year == 2015") # Restrict the dataset to records from 2015.
)
wisconsin_cancer_biopsies = (
pd.read_csv(base + "kaggle-wisconsin-cancer.csv")
.assign(diagnosis=lambda df: df['diagnosis']
.map({"M": "malignant", "B": "benign"})
.astype('category')
)
)
```
%% Cell type:markdown id:8c12f566 tags:
## Minimising errors (in two parameters) on the regression dataset
The code below produces colour maps to explore the variation of mean absolute error (MAE) using a linear predictive model.
%% Cell type:code id:81c32b7d tags:
``` python
# Don't worry too much about the mechanics of the code in this cell; it computes
# model predictive error for various parameters so that we can produce a general
# plot to guide choices of the fitted parameters.
def prediction_error(gradient, intercept):
""" Return the prediction error associated with the value of the parameters. """
who_with_predictions = (
who_data_2015[["Schooling", "Life expectancy"]]
# Add a column for our predictions.
.assign(Predicted=lambda df: df["Schooling"] * gradient + intercept)
.dropna()
)
errors = who_with_predictions['Life expectancy'] - who_with_predictions['Predicted']
return errors.abs().mean()
gradient_values, intercept_values = np.meshgrid(
np.linspace(1.0, 3.0, 30),
np.linspace(35, 80, 30),
)
errors = np.zeros(gradient_values.shape)
for i in range(errors.shape[0]):
for j in range(errors.shape[1]):
errors[i, j] = prediction_error(gradient_values[i, j], intercept_values[i, j])
```
%% Cell type:markdown id:60e46eca tags:
The cell below allows you to choose the gradient and intercept of a model in the form
$$
\text{Life expectancy} = \text{gradient} \times \text{Schooling} + \text{intercept}.
$$
This is a very general linear model. Certain choices of the model lead to low errors, and others to high errors.
Try changing the `gradient` and `intercept` values, and re-run the cell to examine the results. Then discuss the following:
* Is it easy to determine the parameters which give the smallest possible error? Is there an 'obvious' best result?
* Can you comment on the shape of this 2D error function?
* Are there any techniques you can think of (which you may have learned in previous mathematics courses) which might
%% Cell type:code id:b2eee8b5 tags:
``` python
# SET THESE VALUES AND TEST THE OUTPUT.
gradient = 1.5
intercept = 66
# This code plots the coloured background showing how mean absolute error
# changes with these two parameters.
plt.contourf(gradient_values, intercept_values, errors)
plt.xlabel("Gradient")
plt.ylabel("Intercept")
plt.colorbar(label="Error")
# Plots our chosen parameters on the colour map as a red point.
plt.scatter(gradient, intercept, c='r')
# Generate predictions using the selected gradient and intercept.
who_with_predictions = (
who_data_2015[["Schooling", "Life expectancy"]]
# Add a column with a computed prediction based on years of schooling.
.assign(Predicted=lambda df: df["Schooling"] * gradient + intercept)
# Discard for the moment any row where we can't make a prediction
# due to missing data.
.dropna()
)
# Plot both the predicted and actual life expectancy results against years of schooling.
sns.relplot(data=who_with_predictions.set_index("Schooling"));
# Also display the prediction error.
prediction_error(gradient, intercept)
```
%%%% Output: execute_result
13.678034682080925
%%%% Output: display_data
![]()
%%%% Output: display_data
![]()
%% Cell type:markdown id:cae5a84b tags:
## How errors change with different datasets
The cell below plots the error function of the classification model given different values of the `radius` split parameter (as in the introductory notebook). Each time this cell is run it will select two different subsets of the data, and plot the accuracy function (number of correct predictions) for each set. Investigate how the plots vary when you change the value of the parameter `N` (which controls the size of the sampled data used to calculate the accuracy curves), and when you re-run the cell for the same value of `N`. What do you observe in general?
Questions to discuss:
* Do the two curves (in blue and orange) generally have the same or different peaks?
* Does the location (parameter value) of the peak vary at all with the sample size?
* Does the best possible accuracy vary much?
%% Cell type:code id:33c49166 tags:
``` python
# VARY THIS PARAMETER BETWEEN 10 and 200
N = 10
def model_correct_predictions(data_subset, radius_split_parameter):
""" Return the number of correct predictions made by the model
for the given parameter value. """
data = data_subset.assign(
predicted=lambda df: df['radius_mean'].lt(radius_split_parameter)
.map({True: "benign", False: "malignant"})
)
return (data['diagnosis'] == data['predicted']).sum()
# Select two different random subsets of the data, given the size parameter.
train = wisconsin_cancer_biopsies.sample(N)
test = wisconsin_cancer_biopsies.sample(N)
test_parameter_values = np.arange(0, 30, 0.5) # Values spread across the interval 0 -> 30, separated by 0.5
cost_function_results = pd.DataFrame({
"Parameter Value": test_parameter_values,
# Use a Python list comprehension to find MAE for each proposed parameter value.
"Number of Correct Predictions (Train)": [model_correct_predictions(train, m) for m in test_parameter_values],
"Number of Correct Predictions (Test)": [model_correct_predictions(test, m) for m in test_parameter_values],
})
sns.relplot(data=cost_function_results.set_index("Parameter Value"), kind='line');
```
%%%% Output: display_data
![]()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment