Unverified Commit bb2ae0b6 authored by Simon Bowly's avatar Simon Bowly
Browse files

ADS1002 logistic regression.

parent 627e6093
%% Cell type:markdown id: tags:
# Logistic Regression
%% Cell type:markdown id: tags:
* Topic: Supervised machine learning
* Unit: ADS1002
* Level: Beginner
* Authors: Simon Clarke, Simon Bowly
* Version: 3.1
In this lesson we will discuss the logistic regression classifier model. This is closely related to linear regression, but predicts probabilities for classification, instead of continuous values. We will use the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is built in to `seaborn` and `sklearn`.
We will firstly explain how logistic regression can be used to classify binary data, by considering just one species of iris. We will then explain how the accuracy of logistic regression can be evaluated using a range of methods. The exercise will then consider multiclass regression.
First, import various libraries which we require.
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns;
import pandas as pd
```
%% Cell type:markdown id: tags:
## Exploratory Data Analysis
%% Cell type:markdown id: tags:
We now import the iris data set and undertake some exploratory data analysis. The data frame has 150 rows and 5 columns. Viewing the header shows there are four features: sepal_length, sepal_width, petal_length and petal_width. These are all in cm. There is one target variable, which is the species of iris.
%% Cell type:code id: tags:
``` python
iris = sns.load_dataset('iris') # load the dataset from seaborn
print('Shape of the iris dataset is',iris.shape) # display the shape of the data
iris.head() # display the first few lines
```
%%%% Output: execute_result
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
%% Cell type:markdown id: tags:
By viewing the statistics we see that there are no missing entries and the standard deviation of each column is fairly similar. However, we will normalise the data before modelling.
%% Cell type:code id: tags:
``` python
iris.describe() # show the statistics of the numerical columns
```
%%%% Output: execute_result
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
%% Cell type:markdown id: tags:
The target field is a string, and has three unique values.
%% Cell type:code id: tags:
``` python
iris['species'].unique() # show the unique values of the species column
```
%%%% Output: execute_result
array(['setosa', 'versicolor', 'virginica'], dtype=object)
%% Cell type:markdown id: tags:
Plotting the classification against the first two features, it is apparent that setosa is fairly well separated in feature space, whereas versicolor and virginica have a blurred boundary. Note that the boundary may differ using other features. We would expect that classifiers should be able to easily classify setosa, though the other two species may be a bit more difficult. You can experiment boundaries by changing the features or using `sns.pairplot`.
%% Cell type:code id: tags:
``` python
# plot the first two fields of the iris dataset and classify the points based on the species
# fit_reg=False indicates don't fit a regression model
# hue='species' indicates to classify the points based on the species field
sns.lmplot(x="sepal_length",y="sepal_width",hue='species',data=iris,fit_reg=False);
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Since this is a multiclass classification problem, we need to convert the label column to binary columns. This is equivalent to one-hot encoding.
%% Cell type:code id: tags:
``` python
#one-hot encoding
categories = iris.species.unique() # create a vector with the category names,
lencat = len(categories) # store the length of the categories vector to use later
# one hot encoding, create three new columns which are true for that particular species, and false if not
for species in categories: # loop over all the labels in categories
# + concatenates two strings
iris['species_'+species] = pd.Series(iris['species']==species).astype(int)
iris.describe() # show the statistics of the nunmerical features of the dataframe
```
%%%% Output: execute_result
sepal_length sepal_width petal_length petal_width species_setosa \
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 0.333333
std 0.828066 0.435866 1.765298 0.762238 0.472984
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 0.000000
75% 6.400000 3.300000 5.100000 1.800000 1.000000
max 7.900000 4.400000 6.900000 2.500000 1.000000
species_versicolor species_virginica
count 150.000000 150.000000
mean 0.333333 0.333333
std 0.472984 0.472984
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 1.000000 1.000000
max 1.000000 1.000000
%% Cell type:markdown id: tags:
Now we create a features table XX which comprises the first four columns of the iris data set, and a target series Y which is the binary column we have just created to identify whether or not the species is versicolor. Here we will used the normalized version of the features matrix X.
%% Cell type:code id: tags:
``` python
# the table XX will have the first 3 columns (0 to 3) of iris, and Y will have the binary classification column
XX, Y = iris[iris.columns[:4]], iris.species_versicolor
X = (XX-XX.mean())/XX.std() # create a new feature matrix for analysis which has mean 0 and standard deviation 1
```
%% Cell type:markdown id: tags:
Then we split these data sets into training and testing sets.
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split # import the splitting method from sklearn
# split the data into 80% training and 20% testing, random_state=0 ensures that the results are repeatable
X_train,X_test,y_train,y_test=train_test_split(X,Y,train_size=0.8,random_state=0)
```
%% Cell type:markdown id: tags:
## Logistic Regression
%% Cell type:markdown id: tags:
We can now introduce the sigmoid or logistic function. This is a defined as
$$ f(x) = \displaystyle{\frac{1}{1+e^{-x}}}, $$
and is a smooth, one-to-one (every x value gives a unique y value) function with the domain $(-\infty,\infty)$ and range $(0,1)$. If x is negative then the logistic function is less than 0.5 but greater than 0, while if x is positive the logistic function is greater than 0.5 but less than 1.
%% Cell type:code id: tags:
``` python
xa = np.linspace(-6,6) # create a linear array of x values between -6 and 6
plt.plot(xa,1/(1+np.exp(-xa))); # plot the logistic function
plt.xlabel("x") # xlabel
plt.ylabel("y") # ylabel
plt.title("Logistic function y = 1/(1+exp(-x))"); # title
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Logistic regression assumes that the data is described by m instances or points, and at each of these there are n features. At each point from i=1,2,...,m, the probability of the data being classified as true or false is modelled by
$$ y_i = f(c_0 + c_1 x_{i,1} + c_2 x_{i,2} + \cdots + c_n x_{i,n}), $$
where $f$ is the **logistic** function. The algorithm then aims to calculate the coefficients $c_1$, $c_2$,..., $c_n$ and the intercept $c_0$. For linear regression the coefficients were calculated by minimizing the sum of the square of the errors. This is equivalent to maximimizing the likelihood of the observed data, assuming the data points are distributed with mean 0 and some standard deviation $\sigma$. Therefore, the observed data is the most likely data. For logistic regression these two statements are no longer equivalent, however, the coefficients can be calculated by again maximizing the likelihood of the observed data. This is done by using optimization algorithms.
The process for logistic regression using `sklearn` is the same as the other models we have so far considered, however the options change when instantiating the model. We will use the default optimisation solver `lbfgs` to fit the data and show the model partameters.
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LogisticRegression # import the LogisticRegression model
# instantiate the model (using the default parameters)
# penalty='none' implies no regularization (to be covered next week) and solver='lbfgs' is the default solver
# different solvers can be used, dependent on the type of penalties that are implemented
logreg = LogisticRegression(solver='lbfgs',penalty='none')
logreg.fit(X_train,y_train) # fit the training data to the model
print('Model coefficients are',np.round(logreg.coef_,3)) # print the model coefficients c1,...,c4
print('Model intercept is',np.round(logreg.intercept_,3)) # print the model intercept c0
```
%% Cell type:markdown id: tags:
Next we investigate the probabilities that are output. The second column, which is the probability of correctly picking the species, is the output of the logistic function. Since the problem is binary, the first column is just 1 minus the second column.
%% Cell type:code id: tags:
``` python
y_preda = logreg.predict_proba(X_test) # calculate the probabilities for the test features
# print out the probability table with a header
print('Probability table for testing set is:')
print(y_preda)
```
%% Cell type:markdown id: tags:
Using a classification threshold of 0.5, i.e., $P>0.5$ indicates true and $P<0.5$ indicates false, the probabilities are then used to predict the target test values, which can be compared against the actual target test values. We can see there are correct predictions at (1,1) (true positives) and (0,0) (true negative), but there are also false positives at (1,0) and false negatives at (0,1). False positives indicates that the model incorrectly predicts a positive value, similarly false negatives indicates that the model incorrectly predicts a negative value.
%% Cell type:code id: tags:
``` python
y_pred=logreg.predict(X_test) # calculate the predicted values of the model for the test features
plt.scatter(y_pred,y_test) # plot the predicted values against the actual test values
plt.xlabel('Predicted') # xlabel
plt.ylabel('Actual') # ylabel
plt.title('Logistic Regression'); # add a title
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
To determine the number of correct predictions, false positives and false negatives, we can construct a confusion matrix. From the confusion matrix it can be seen that there are 15 true negatives, 5 true positives, 2 false positives where versicolor was incorrectly predicted, and 8 false negatives, where the other species was incorrectly predicted. The false positive and negatives will typically relate to the virginica species, rather than setosa.
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix # import the confusion matrix function
cnf_matrix = confusion_matrix(y_test, y_pred) # create a confusion matrix for our actual and predicted values
# create a data frame from the confusion matrix with the column and row names being the class_names
class_names=['other', 'versicolor'] # names of the binary classes for plotting
cmatrix = pd.DataFrame(cnf_matrix,columns=class_names,index=class_names)
f, ax = plt.subplots(figsize=(7,6)) # initialise the plots and axes
sns.heatmap(cmatrix, annot=True, linewidths=.5) # plot the confusion matrix as a heatmap
plt.title('Confusion matrix') # add a title
plt.ylabel('Actual label') # add a ylabel
plt.xlabel('Predicted label') # add a xlabel
# adjust the bottom and top of the figure, so we can view all of it
bottom, top = ax.get_ylim() # get the y axis limits
ax.set_ylim(bottom + 0.5, top - 0.5); # adjust the y axis limits
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
We can calculate various scores to determine the accuracy of the classifier. The accuracy score is the number of correct scores divided by the number of samples. The precision score is the number of true positives divided by the true positives plus false positives. The recall score is the number of true positives divided by the true positives plus false negatives. The precision and recall scores give an indication of how well the algorithm is able to pick positive samples, whereas the accurary gives an indication of how well the algorithm can predict correct samples overall.
%% Cell type:code id: tags:
``` python
from sklearn.metrics import accuracy_score, precision_score, recall_score # import the score functions
print("Accuracy:",np.round(accuracy_score(y_test, y_pred),3)) # calculate and print the accuracy score
print("Precision:",np.round(precision_score(y_test, y_pred),3)) # calculate and print the precision score
print("Recall:",np.round(recall_score(y_test, y_pred),3)) # calculate and print the recall score
```
%% Cell type:markdown id: tags:
To further quantify the accurary of the classifier we can construct a Response Operating Characteristic (ROC) curve, which plots the false positive rate (FPR) against the true positive rate (TPR) as the classification threshold is changed. The TPR is the same as the precision score, while the FPR is the number of false positives divided by the false positives plus true negatives. For a perfect classifier we would expect that TPR=1 and FPR=0. We have considered up until this point a classification threshold of 0.5. If the classification threshold is very high (close to 1), then we are not going to flag any positive respones, so TPR=0 and FPR=0. If the classification threshold is very low (close to 0), then we will only flag positive responses and TPR=1 and FPR=1. Hence if our classifier is good, as the threshold changes the ROC curve will hug the left and top boundaries of the figure. If this occurs the area under the ROC curve (AUC) will approach 1. Therefore classifiers can be compared for a particular problem by measuring the AUC. Any classifier which is below the line y=x and has AUC less than 0.5 will be useless.
%% Cell type:code id: tags:
``` python
# import the functions to calculate the parameters for the ROC curve and the AUC
from sklearn.metrics import roc_curve, roc_auc_score
y_pred_proba = logreg.predict_proba(X_test)[::,1] # extract the second column of the model probabilities
# calculate the false positive and true positive rates as the threshold is varied, we don't use thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba) # calculate the area under the ROC curve (AUC)
# plot the FPR vs TPR and format label with AUC to 3 decimal places
plt.plot(fpr,tpr,label="Logistic Regression, auc = %0.3f " % auc)
plt.plot([0,1],[0,1],'k--') # plot x = y for comparison
plt.plot([0, 0, 1], [0, 1, 1], 'g--', label="'Perfect' Classifier")
plt.xlabel('False Positive Rate') # add xlabel
plt.ylabel('True Positive Rate') # add ylabel
plt.title('ROC curve') # add title
plt.legend(loc='best'); # add legend
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Exercise
%% Cell type:markdown id: tags:
In this exercise you are going to construct a simple one vs rest multiclass classifier based on the probabilities for each of the binary problems for the three iris species. A one-vs-rest classification scheme constructs multiple models (one for each class) where each model predicts a binary classification and combines the results to come up with a single prediction.
For this problem, only use the first two features, i.e., sepal width and sepal length.
The steps for this multiclass classifier are:
1. Create a new features dataframe only using the first two columns of the iris features.
2. For each of the iris categories, model the data using Logistic Regression and calculate the probabilities for the testing set of the category being correct. Store these probabilites in an array. For each category you will need to re-do the train-test split, but make sure you use the same random state. Alternatively, split the data set initially with all three labels, and then work on each category. The necessary one-hot encoding has already been performed.
3. For each instance in feature space, choose the category with the highest probability. The function `numpy.argmax()` will be useful.
4. Plot the final classification for the testing set in feature space, with colours based on the predicted category.
Q: After completing the code below and inspecting the results, comment on which class(es) appear most difficult to predict accurately. Can you suggest any improvements to the model which might result in better accuracy?
A:
%% Cell type:code id: tags:
``` python
# Part 1
```
%% Cell type:code id: tags:
``` python
# Part 2
```
%% Cell type:code id: tags:
``` python
# Part 3
```
%% Cell type:code id: tags:
``` python
# Part 4
```
%% Cell type:code id: tags:
``` python
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment