Commit 090bdfca authored by Simon Clarke's avatar Simon Clarke
Browse files

Published Neural Networks for Auto MPG datasets notebook.

parent 73034fae
%% Cell type:markdown id: tags:
# Regression using Neural Networks
%% Cell type:markdown id: tags:
In our previous notebooks on neural networks we have considered classification problems. Regression problems, where the output is a continuous value, can also be handled using neural networks. To demonstrate this we will use the classic [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) to build a model to predict the fuel efficiency of 1970s and early 1980s automobiles. This datasets describes fuel efficiency of many cars from that period. This description includes attributes such as: engine cylinders, engine displacement, horsepower, and weight.
We will show how linear regression can be implemented using neural networks, and then consider some simple nonlinear regression models.
%% Cell type:markdown id: tags:
## Contents
%% Cell type:markdown id: tags:
* Imports
* The Auto MPG Dataset
* Linear Regression
* Nonlinear Regression
* Model Performance
* Exercises
%% Cell type:markdown id: tags:
## Imports
%% Cell type:markdown id: tags:
We import the standard libraries. If you are running this on Google Colab, and `seaborn` cannot be found, then uncomment the following cell.
%% Cell type:code id: tags:
```
# !pip install -q seaborn
```
%% Cell type:code id: tags:
```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
```
%% Cell type:markdown id: tags:
We import `tensorflow` and `keras`, the package `layers` for setting up sequential models and a function for normalizing data.
%% Cell type:code id: tags:
```
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
```
%% Cell type:markdown id: tags:
## The Auto MPG dataset
%% Cell type:markdown id: tags:
We first download and import the dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/) using pandas. The names can be found in the file [auto-mpg.names](https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names).
%% Cell type:code id: tags:
```
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
```
%% Cell type:code id: tags:
```
df.sample(5)
```
%% Cell type:markdown id: tags:
First we check for missing values, which are denoted by `?` in the file, and which the import statement has converted to NaN.
%% Cell type:code id: tags:
```
df.isna().sum()
```
%% Cell type:markdown id: tags:
Drop those rows to keep this simple.
%% Cell type:code id: tags:
```
df.dropna(inplace=True)
```
%% Cell type:markdown id: tags:
The `"Origin"` column is categorical, not numeric. So we first use a dictionary to convert it to categorical data and then one-hot encode with `pd.get_dummies`.
%% Cell type:code id: tags:
```
df['Origin'] = df['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
```
%% Cell type:code id: tags:
```
df = pd.get_dummies(df, columns=['Origin'], prefix='', prefix_sep='')
df.sample(5)
```
%% Cell type:markdown id: tags:
We can view the statistics, to see the range of each variable.
%% Cell type:code id: tags:
```
pd.set_option("precision", 2)
df.describe()
```
%% Cell type:markdown id: tags:
We can investigate the correlation between variables. The fuel efficiency (MPG) is highly correlated with the variables `Cylinders`, `Displacement`, `Horsepower` and `Weight`, and the correlation between these variables is very high.
%% Cell type:code id: tags:
```
corrs = df.corr() # calculate the correlation table
# as this is a symmetric table, set up a mask so that we only plot values below the main diagonal
mask = np.triu(np.ones_like(corrs, dtype=bool))
f, ax = plt.subplots(figsize=(10, 8)) # initialise the plots and axes
# plot the correlations as a seaborn heatmap, with a colourbar
sns.heatmap(corrs, mask=mask, center=0, annot=True, square=True, linewidths=.5)
# do some fiddling so that the top and bottom are not obscured
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5);
```
%% Cell type:markdown id: tags:
We can use `sns.pairplot()` to view the relationship between different features. What we now want to do is create a model which determines the fuel efficiency (MPG) as a function of the other features. It appears that fuel efficiency is approximately inversely proportional to the other variables.
%% Cell type:code id: tags:
```
sns.pairplot(df[['MPG', 'Cylinders', 'Horsepower', 'Displacement', 'Weight']],
diag_kind='kde', height=2)
```
%% Cell type:markdown id: tags:
Since we want to predict `MPG`, this is our label, and the other variables are features. We can separate the data into the label and features, and then split both sets into testing and training sets.
%% Cell type:code id: tags:
```
from sklearn.model_selection import train_test_split
features = df.drop(['MPG'], axis=1)
labels = df['MPG']
X_train, X_test, y_train, y_test = train_test_split(features, labels,
test_size=0.2,
random_state=42)
```
%% Cell type:markdown id: tags:
In the table of statistics it is apparent that the values of the features are widely distributed.
%% Cell type:code id: tags:
```
pd.set_option("precision", 2)
X_train.describe().loc[['mean', 'std']]
```
%% Cell type:markdown id: tags:
As with classification with neural networks, it is best practice to normalize features that use different scales and ranges. This ensures that techniques such as regularization, which we will consider later, can be applied uniformly.
There is no advantage to normalizing the one-hot features, it is done here for simplicity.
%% Cell type:markdown id: tags:
Previously we have used the `mean` and `std` to normalize the variables. Here we will use `keras` `preprocessing.Normalization` layer to build the preprocessing into the model.
The first step is to create the layer. `axis=-1` states to apply the normalization to the last dimension.
%% Cell type:code id: tags:
```
normalizer = preprocessing.Normalization(axis=-1)
```
%% Cell type:markdown id: tags:
Then `.adapt()` it to the data, which calculates the mean and variance, and stores them in the layer.
%% Cell type:code id: tags:
```
normalizer.adapt(np.array(X_train))
```
%% Cell type:code id: tags:
```
print(normalizer.mean.numpy())
```
%% Cell type:markdown id: tags:
When the layer is called it returns the input data, with each feature independently normalized.
%% Cell type:code id: tags:
```
firstrow = np.array(X_train[:1])
print('First example:', firstrow)
print('Normalized:', normalizer(firstrow).numpy())
```
%% Cell type:markdown id: tags:
## Linear regression
%% Cell type:markdown id: tags:
Before building a nonlinear neural network model, we will build a model which implements linear regression. This corresponds to a single perceptron with a continuous output.
%% Cell type:markdown id: tags:
First we create a convenience function to build and compile a simple sequential neural network. This is essentially the same as the models we have previously considered for classification, except the last layer is a `Dense` layer, with a single continuous output, and the loss function is the mean absolute error.
For this models we can specify different inputs using `norm`, the number of hidden layers and the neurons in each of these layers, and the learning rate used by SGD.
%% Cell type:code id: tags:
```
def build_model_regress(norm, n_hidden=1, n_neurons=30,
learning_rate=0.01):
"""Build and compile a simple sequential model with n_hidden
layers and n_neurons in each layer."""
model = keras.models.Sequential()
model.add(norm)
for layer in range(n_hidden):
model.add(keras.layers.Dense(n_neurons, activation="relu"))
model.add(keras.layers.Dense(1))
optimizer = keras.optimizers.SGD(learning_rate=learning_rate)
model.compile(loss='mean_absolute_error', optimizer=optimizer)
return model
```
%% Cell type:markdown id: tags:
We start by creating a linear regression model which takes the input `Horsepower` and aims to predict `MPG`.
For our default model the first thing we need to do is create the horsepower `Normalization` layer. This just corresponds to creating an array with the `Horsepower` from the training set, initializing the normalization routine so that the input shape is a single vector (one feature) and then adapting the normalizer to calculate the mean and standard deviation of `Horsepower`.
%% Cell type:code id: tags:
```
horsepower = np.array(X_train['Horsepower'])
horsepower_normalizer = preprocessing.Normalization(input_shape=[1,], axis=None)
horsepower_normalizer.adapt(horsepower)
```
%% Cell type:markdown id: tags:
This layer then can be used as the input for our model. Since we are doing linear regression, we want no hidden layers and can set the number of neurons to 0. We can then output a summary of the model.
%% Cell type:code id: tags:
```
horsepower_model = build_model_regress(horsepower_normalizer, n_hidden=0,
n_neurons=0, learning_rate=0.03)
horsepower_model.summary()
```
%% Cell type:markdown id: tags:
Now the model is configured, we use `Model.fit()` to train the model. Here we use 80% of the data for training and 20% for validation. The evolution of the metrics for the model are stored in `history`.
%% Cell type:code id: tags:
```
%%time
history = horsepower_model.fit(
X_train['Horsepower'], y_train,
epochs=200,
# suppress logging
verbose=0,
# Calculate validation results on 20% of the training data
validation_split = 0.2)
```
%% Cell type:markdown id: tags:
We will create a simple function for plotting the history of the model.
%% Cell type:code id: tags:
```
def plot_loss(history):
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylim([0, 20])
plt.xlabel('Epoch')
plt.ylabel('Error [MPG]')
plt.legend()
plt.grid(True)
```
%% Cell type:markdown id: tags:
For this model the training loss and the validation loss decreases steadily, with the training loss always being less than the validation loss, which is to be expected.
%% Cell type:code id: tags:
```
plot_loss(history)
```
%% Cell type:markdown id: tags:
We evaluate the results and store them in a structure for comparison with the other models.
%% Cell type:code id: tags:
```
test_results = {}
test_results['lin_horsepower_model'] = horsepower_model.evaluate(
X_test['Horsepower'],
y_test, verbose=0)
```
%% Cell type:markdown id: tags:
Since this is a single variable linear regression the output corresponds to a linear relationship, and we can compare the model predictions against the actual value.
%% Cell type:code id: tags:
```
x = tf.linspace(0.0, 250, 251)
y = horsepower_model.predict(x)
```
%% Cell type:markdown id: tags:
We define another convenience function for comparing the predictions.
%% Cell type:code id: tags:
```
def plot_horsepower(x, y):
plt.scatter(X_train['Horsepower'], y_train, label='Data')
plt.plot(x, y, color='k', label='Predictions')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.legend()
```
%% Cell type:markdown id: tags:
The predictions are reasonable for mid-range horsepower, but fail at the upper and lower limits.
%% Cell type:code id: tags:
```
plot_horsepower(x,y)
```
%% Cell type:markdown id: tags:
To implement multi-dimensional linear regression, we now just need to input the normalization layer which was defined earlier for the whole data set. Now the input shape corresponds 9 features.
%% Cell type:code id: tags:
```
linear_model = build_model_regress(normalizer, n_hidden=0, n_neurons=0,
learning_rate=0.03)
linear_model.summary()
```
%% Cell type:markdown id: tags:
We can now `fit` the model using the full training dataset.
%% Cell type:code id: tags:
```
%%time
history = linear_model.fit(
X_train, y_train,
epochs=200,
# suppress logging
verbose=0,
# Calculate validation results on 20% of the training data
validation_split = 0.2)
```
%% Cell type:markdown id: tags:
Plotting the history demonstrates that using all the inputs achieves a much lower training and validation error than the `horsepower` model.
%% Cell type:code id: tags:
```
plot_loss(history)
```
%% Cell type:markdown id: tags:
We collect the results on the test set, for later comparison.
%% Cell type:code id: tags:
```
test_results['lin_model'] = linear_model.evaluate(
X_test, y_test, verbose=0)
```
%% Cell type:markdown id: tags:
## Nonlinear regression
%% Cell type:markdown id: tags:
The previous section implemented linear models for single and multiple inputs.
This section implements single-input and multiple-input Neural Network models. The code is essentially the same except the model is expanded to include hidden nonlinear layers.
These models will contain a few more layers than the linear model:
* The normalization layer.
* Two hidden, nonlinear, `Dense` layers using the `relu` nonlinearity.
* A linear single-output layer.
%% Cell type:markdown id: tags:
We start with a model for the single input "Horsepower". Note that the only difference is the number of hidden layers, and the number of neurons in these layers. However, there are now significantly more trainable parameters.
%% Cell type:code id: tags:
```
nn_horsepower_model = build_model_regress(horsepower_normalizer, n_hidden=2,
n_neurons=64, learning_rate=0.03)
nn_horsepower_model.summary()
```
%% Cell type:markdown id: tags:
Training of the model is the same as before.
%% Cell type:code id: tags:
```
%%time
history = nn_horsepower_model.fit(
X_train['Horsepower'], y_train,
validation_split=0.2,
verbose=0, epochs=200)
```
%% Cell type:markdown id: tags:
This model does slightly better than the linear-horsepower model, but the initial convergence is much more rapid.
%% Cell type:code id: tags:
```
plot_loss(history)
```
%% Cell type:markdown id: tags:
Plotting the predictions as a function of `Horsepower`, we now see this model takes advantage of the nonlinearity provided by the hidden layers.
%% Cell type:code id: tags:
```
x = tf.linspace(0.0, 250, 251)
y = nn_horsepower_model.predict(x)
plot_horsepower(x, y)
```
%% Cell type:markdown id: tags:
We collect the results of the test set, for later comparison.
%% Cell type:code id: tags:
```
test_results['nn_horsepower_model'] = nn_horsepower_model.evaluate(
X_test['Horsepower'], y_test,
verbose=0)
```
%% Cell type:markdown id: tags:
This process can be repeated using all the inputs, which slightly improves the performance on the validation dataset.
%% Cell type:code id: tags:
```
nn_model = build_model_regress(normalizer, n_hidden=2, n_neurons=64,
learning_rate=0.03)
nn_model.summary()
```
%% Cell type:code id: tags:
```
%%time
history = nn_model.fit(
X_train, y_train,
validation_split=0.2,
verbose=0, epochs=200)
```
%% Cell type:code id: tags:
```
plot_loss(history)
```
%% Cell type:markdown id: tags:
Again we collect the results on the test set.
%% Cell type:code id: tags:
```
test_results['nn_model'] = nn_model.evaluate(X_test, y_test, verbose=0)
```
%% Cell type:markdown id: tags:
## Performance
%% Cell type:markdown id: tags:
Now that all the models are trained we can compare the performance. Not surprisingly, as the complexity of the model increases the absolute error of the model decrease. This suggests that the final model does not have excessive overfitting.
%% Cell type:code id: tags:
```
pd.DataFrame(test_results, index=['Mean absolute error [MPG]']).T
```
%% Cell type:markdown id: tags:
Finally, we can use `predict` to compare the predicted values for the testing set against the actual values.
%% Cell type:code id: tags:
```
test_predictions = nn_model.predict(X_test).flatten()
plt.scatter(y_test, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
lims = [0, 50]
plt.xlim(lims)
plt.ylim(lims)
plt.plot(lims,lims);
```
%% Cell type:markdown id: tags:
This suggests the model predicts the fuel efficiency reasonably well, so we can save it for later use.
%% Cell type:code id: tags:
```
nn_model.save('nn_model.hd5');
```
%% Cell type:markdown id: tags:
## Exercises
%% Cell type:markdown id: tags:
For these exercises we will investigate regularization techniques to cope with overfitting for the full nonlinear model. There are two-standard techniques for dealing with overfitting.
The first of these is to use L2 (Ridge) or L1 (Lasso) Regularization on each layer. These add a penalty term to the objective function which proportional to the square or absolute value of the weights. The objective is to reduce the number of non-zero weights in the system. This is analogous to what was previously considered with linear and logistic regression.
The second method is to use dropout layers. In a dropout layer, during training a randomly chosen percentage of the nodes in each layer are ignored at each iteration. This reduces the sensitivity of the network to the training set and hence creates a more robust model.
The function below generalizes the one created earlier to include L2 Regularization and Dropout layers.
%% Cell type:code id: tags: