Unverified Commit a3759901 authored by Simon Bowly's avatar Simon Bowly
Browse files

ADS1002 - week 7 errors and explanations notebook.

parent 97a8800a
%% Cell type:markdown id:55078a6a tags:
# Manipulating DataFrames
Some examples to help understand common errors in dataframes, and the difference between a dataframe and a series.
%% Cell type:code id:2b33c47d tags:
``` python
# Starting with a simple example dataset we've seen before: the iris dataset.
import pandas as pd
import seaborn as sns
iris_dataframe = sns.load_dataset("iris")
iris_dataframe.sample(10)
```
%%%% Output: execute_result
sepal_length sepal_width petal_length petal_width species
78 6.0 2.9 4.5 1.5 versicolor
117 7.7 3.8 6.7 2.2 virginica
88 5.6 3.0 4.1 1.3 versicolor
30 4.8 3.1 1.6 0.2 setosa
72 6.3 2.5 4.9 1.5 versicolor
60 5.0 2.0 3.5 1.0 versicolor
135 7.7 3.0 6.1 2.3 virginica
41 4.5 2.3 1.3 0.3 setosa
75 6.6 3.0 4.4 1.4 versicolor
56 6.3 3.3 4.7 1.6 versicolor
%% Cell type:code id:2f3d77a9 tags:
``` python
# Q1 -> what's the result of this command?
# A -> This gives a series (single column).
# Note that a Dataframe is collection of series (one for each column)
# which are aligned on the same index.
iris_dataframe['sepal_length']
```
%%%% Output: execute_result
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
...
145 6.7
146 6.3
147 6.5
148 6.2
149 5.9
Name: sepal_length, Length: 150, dtype: float64
%% Cell type:code id:41f4ec0a tags:
``` python
# Q2 -> what's the result of this command?
# A -> This gives a two-column dataframe.
iris_dataframe[['sepal_length', 'sepal_width']]
```
%%%% Output: execute_result
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
4 5.0 3.6
.. ... ...
145 6.7 3.0
146 6.3 2.5
147 6.5 3.0
148 6.2 3.4
149 5.9 3.0
[150 rows x 2 columns]
%% Cell type:code id:ae132736 tags:
``` python
# Q3 -> what's the result of this command?
# A -> This gives one two-column dataframe.
# Notes: in general this column-selection syntax can be broken down as:
# iris_dataframe[..something...]
# The outer set of square brackets indicate that we are selecting some data.
# In python this is referred to as slicing.
# If 'something' is just one string, we get back one column as a series.
# If 'something' is a list of strings, we get back a dataframe with the
# selected columns.
# If this list has only one entry, we still get a dataframe, it just has
# only one column.
# So, the two square brackets [[...]] mean we are selecting with a list of
# length one.
iris_dataframe[['sepal_length']]
```
%%%% Output: execute_result
sepal_length
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
.. ...
145 6.7
146 6.3
147 6.5
148 6.2
149 5.9
[150 rows x 1 columns]
%% Cell type:code id:f34c0121 tags:
``` python
# Q4 -> what's the result of this command?
# A -> KeyError. KeyError is common for lookup errors, we couldn't find the
# column name. Note the error message 'not in index'; pandas considers both
# the index (row labels) and columns (column labels) as 'indexes' of some sort,
# so an error message 'not in index' may indicate a failed column lookup.
iris_dataframe[['sepal_lingth', 'sepal_width']]
```
%%%% Output: error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_9573/2938404256.py in <module>
4 # the index (row labels) and columns (column labels) as 'indexes' of some sort,
5 # so an error message 'not in index' may indicate a failed column lookup.
----> 6 iris_dataframe[['sepal_lingth', 'sepal_width']]
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3459 if is_iterator(key):
3460 key = list(key)
-> 3461 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3462
3463 # take() does not accept boolean indexers
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
KeyError: "['sepal_lingth'] not in index"
%% Cell type:code id:8991f60a tags:
``` python
# Note that we still get KeyErrors in the next two cases, but sometimes
# the error message is slightly different (in this case, it is actually
# more informative!).
iris_dataframe[['sepal_lingth', 'sepal_wedth']]
```
%% Cell type:code id:49073fc1 tags:
``` python
# Another error... one thing to note: you can mostly ignore the initial part
# of the error message which refers to pandas internal code. This is called
# a stack trace. In some cases it will be useful in helping you debug your own
# code, but for simple one-line operations like this, the stack trace just
# looks at pandas internal code. To figure out what has gone wrong in this case
# focus on the error message itself (right at the bottom of all this output).
iris_dataframe[['sepal_lingth', 'sepal_width']]
```
%% Cell type:code id:b9370647 tags:
``` python
# This piece of code is a bit contrived, but here is a case where looking
# at the stack trace is helpful.
# The very first part of the trace refers to the code in our notebook cell:
#
# 1 iris_dataframe[['sepal_length', 'sepal_width']]
# 2 iris_dataframe[['petal_length', 'sepal_width']]
# ----> 3 iris_dataframe[['sepal_lingth', 'sepal_width']]
# 4 iris_dataframe[['sepal_length', 'petal_width']]
# 5 iris_dataframe[['sepal_length', 'petal_length']]
#
# This indicates that the error originated from the 3rd line of our code.
iris_dataframe[['sepal_length', 'sepal_width']]
iris_dataframe[['petal_length', 'sepal_width']]
iris_dataframe[['sepal_lingth', 'sepal_width']]
iris_dataframe[['sepal_length', 'petal_width']]
iris_dataframe[['sepal_length', 'petal_length']]
```
%%%% Output: error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_9573/774806270.py in <module>
13 iris_dataframe[['sepal_length', 'sepal_width']]
14 iris_dataframe[['petal_length', 'sepal_width']]
---> 15 iris_dataframe[['sepal_lingth', 'sepal_width']]
16 iris_dataframe[['sepal_length', 'petal_width']]
17 iris_dataframe[['sepal_length', 'petal_length']]
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3459 if is_iterator(key):
3460 key = list(key)
-> 3461 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3462
3463 # take() does not accept boolean indexers
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
KeyError: "['sepal_lingth'] not in index"
%% Cell type:markdown id:90dce1de tags:
# Types
%% Cell type:code id:0783d68c tags:
``` python
iris_dataframe.dtypes
```
%%%% Output: execute_result
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object
%% Cell type:markdown id:9a38303b tags:
**Question**: if there are nan values would the datatype be 'object'?
**Answer**: not necessarily, numeric columns can still have NaN values. More specifically:
* Floating-point (type = 'float') columns can have NaN values.
* An integer column **cannot** have NaN values. If you include NaN values in an integer column the entire column is (automatically) turned into floating point. This is the result of some internal implementation in pandas.
* Object columns typically indicate string values, a common example is un-converted datetimes read from csv. Use pd.to_datetime to convert these to pandas' native type so that they are handled properly.
%% Cell type:code id:6c071566 tags:
``` python
# There is also a categorical type conversion we can do:
iris_dataframe["species"].astype('category')
```
%%%% Output: execute_result
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: category
Categories (3, object): ['setosa', 'versicolor', 'virginica']
%% Cell type:markdown id:cc07ad2e tags:
# Plots
Similar errors occur when using pandas + seaborn.
%% Cell type:code id:51755d1b tags:
``` python
# Q5 -> what's the result of this command?
# A -> a scatter plot (this is the default for relplot). In this case
# the function call indicates that the plot data should be sourced from
# the 'iris_dataframe' dataframe. This means all columns referred to must
# be present in this dataframe. Here the x/y position of points comes
# from sepal_width/sepal_length, and the category colour comes from the
# species column. Hence we get 3 unique colours.
sns.relplot(
data=iris_dataframe, # source dataframe
x="sepal_width",
y="sepal_length",
hue='species',
)
```
%%%% Output: execute_result
<seaborn.axisgrid.FacetGrid at 0x7fa44d9faa30>
%%%% Output: display_data
![]()
%% Cell type:code id:2ea6bc0e tags:
``` python
# Q6 -> what's the result of this command?
# A -> ValueError. We mis-spelled a column, but with seaborn we get
# a ValueError instead of a KeyError. Why? I'm not really sure, it's
# just a different choice made by the developers of seaborn vs. pandas.
# But the take-home message is: ignore the error type. Look for the
# first ----> in the stack trace to figure out which line of your code
# the error came from, and read the final error message carefully.
sns.relplot(
data=iris_dataframe,
x="sepal_wodth",
y="sepal_length",
hue='species',
)
```
%%%% Output: error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_9573/3922716469.py in <module>
6 # first ----> in the stack trace to figure out which line of your code
7 # the error came from, and read the final error message carefully.
----> 8 sns.relplot(
9 data=iris_dataframe,
10 x="sepal_wodth",
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/seaborn/relational.py in relplot(x, y, hue, size, style, data, row, col, col_wrap, row_order, col_order, palette, hue_order, hue_norm, sizes, size_order, size_norm, markers, dashes, style_order, legend, kind, height, aspect, facet_kws, units, **kwargs)
945
946 # Use the full dataset to map the semantics
--> 947 p = plotter(
948 data=data,
949 variables=plotter.get_semantics(locals()),
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/seaborn/relational.py in __init__(self, data, variables, x_bins, y_bins, estimator, ci, n_boot, alpha, x_jitter, y_jitter, legend)
585 )
586
--> 587 super().__init__(data=data, variables=variables)
588
589 self.alpha = alpha
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/seaborn/_core.py in __init__(self, data, variables)
603 def __init__(self, data=None, variables={}):
604
--> 605 self.assign_variables(data, variables)
606
607 for var, cls in self._semantic_mappings.items():
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/seaborn/_core.py in assign_variables(self, data, variables)
666 else:
667 self.input_format = "long"
--> 668 plot_data, variables = self._assign_variables_longform(
669 data, **variables,
670 )
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/seaborn/_core.py in _assign_variables_longform(self, data, **kwargs)
901
902 err = f"Could not interpret value `{val}` for parameter `{key}`"
--> 903 raise ValueError(err)
904
905 else:
ValueError: Could not interpret value `sepal_wodth` for parameter `x`
%% Cell type:markdown id:b141373e tags:
# KeyError vs ValueError
**Question**: What's the difference between key error and value error?
**Answer**: All python errors have types. Typically KeyError means "you
tried to look something up, but I couldn't find it" and ValueError means
"you gave me some input I don't know how to deal with". More important
that the type, though, is the error message itself, which should give you
a clue as to the root cause of the error.
%% Cell type:markdown id:b6f67b8a tags:
# Split, Fit, Predict, Evaluate
%% Cell type:code id:4f788487 tags:
``` python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score
```
%% Cell type:code id:46226fe1 tags:
``` python
# Gives a series of true/false values. Specifically, we extract the 'species'
# column and compare every entry to the string 'versicolor'.
iris_dataframe['species'] == 'versicolor'
```
%%%% Output: execute_result
0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Name: species, Length: 150, dtype: bool
%% Cell type:code id:1f0a9108 tags:
``` python
# From the Logistic Regression notebook.
# Here we create another column using the boolean column from the previous cell.
iris_dataframe['is_versicolor'] = (iris_dataframe['species'] == 'versicolor')
# We then convert it to a 0/1 value (integer type). This allowed me to plot
# the classification on the y-axis in the examples.
iris_dataframe['is_versicolor'] = iris_dataframe['is_versicolor'].astype(int)
# Select 10 random columns. Otherwise we'll only see the top and bottom 10 values.
# None of these are 'versicolor' ... so my fancy new column would appear
# to be all zeroes!
iris_dataframe.sample(10)
```
%%%% Output: execute_result
sepal_length sepal_width petal_length petal_width species \
49 5.0 3.3 1.4 0.2 setosa
97 6.2 2.9 4.3 1.3 versicolor
118 7.7 2.6 6.9 2.3 virginica
58 6.6 2.9 4.6 1.3 versicolor
141 6.9 3.1 5.1 2.3 virginica
104 6.5 3.0 5.8 2.2 virginica
50 7.0 3.2 4.7 1.4 versicolor
134 6.1 2.6 5.6 1.4 virginica
139 6.9 3.1 5.4 2.1 virginica
120 6.9 3.2 5.7 2.3 virginica
is_versicolor
49 0
97 1
118 0
58 1
141 0
104 0
50 1
134 0
139 0
120 0
%% Cell type:code id:e524f0ca tags:
``` python
# Just to verify: our dataset now has 150 rows and 6 columns.
print(iris_dataframe.shape)
iris_dataframe.head()
```
%%%% Output: execute_result
sepal_length sepal_width petal_length petal_width species is_versicolor
0 5.1 3.5 1.4 0.2 setosa 0
1 4.9 3.0 1.4 0.2 setosa 0
2 4.7 3.2 1.3 0.2 setosa 0
3 4.6 3.1 1.5 0.2 setosa 0
4 5.0 3.6 1.4 0.2 setosa 0
%% Cell type:code id:d0969fdb tags:
``` python
# A not very helpful help message, indicating that we should pass this
# function a 'sequence of indexables'. This is because sklearn functions
# have more generalist capabilities than what we have seen so far.
help(train_test_split)
```
%% Cell type:code id:8d03ffde tags:
``` python
# Q7 - how many rows and columns in X_train & y_train? 100 rows, 2 & 1 columns respectively
# Q8 - how many rows and columns in X_test & y_test? 50 rows, 2 & 1 columns respectively
X_train, X_test, y_train, y_test = train_test_split(
iris_dataframe[['sepal_width', 'sepal_length']], # 2 column, feature values
iris_dataframe['is_versicolor'], # target value (series, one column)
test_size=0.33, # one third of the data (50 rows) should go into the test set
random_state=45, # seed value: same seed value --> same split every time
)
X_train
```
%%%% Output: execute_result
sepal_width sepal_length
45 3.0 4.8
66 3.0 5.6
128 2.8 6.4
48 3.7 5.3
144 3.3 6.7
.. ... ...
68 2.2 6.2
95 3.0 5.7
32 4.1 5.2
124 3.3 6.7
131 3.8 7.9
[100 rows x 2 columns]
%% Cell type:code id:3b155d37 tags:
``` python
# Using fit-predict: follow the outputs of train_test_split
model = LogisticRegression()
model.fit(X_train, y_train) # features values and corresponding true target values
model.predict(X_test) # feature values only -> returns predicted target values
```
%%%% Output: execute_result
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 1, 1, 0, 0])
%% Cell type:code id:bd5e68a0 tags:
``` python
# An error: we passed predict some columns that it couldn't deal with
# (since we never fitted the model to all columns, only 'sepal_width'
# and 'sepal_length'
model.predict(iris_dataframe)
```
%%%% Output: error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_9573/331035052.py in <module>
2 # (since we never fitted the model to all columns, only 'sepal_width'
3 # and 'sepal_length'
----> 4 model.predict(iris_dataframe)
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/sklearn/linear_model/_base.py in predict(self, X)
307 Predicted class label per sample.
308 """
--> 309 scores = self.decision_function(X)
310 if len(scores.shape) == 1:
311 indices = (scores > 0).astype(int)
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
282 check_is_fitted(self)
283
--> 284 X = check_array(X, accept_sparse='csr')
285
286 n_features = self.coef_.shape[1]
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
671 array = array.astype(dtype, casting="unsafe", copy=False)
672 else:
--> 673 array = np.asarray(array, order=order, dtype=dtype)
674 except ComplexWarning as complex_warning:
675 raise ValueError("Complex data not supported\n"
~/.pyenv/versions/3.9.6/lib/python3.9/site-packages/pandas/core/generic.py in __array__(self, dtype)
1991
1992 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993 return np.asarray(self._values, dtype=dtype)
1994
1995 def __array_wrap__(
ValueError: could not convert string to float: 'setosa'
%% Cell type:code id:1844e841 tags:
``` python
# Q9 - what happens here, and why?
# A: we passed the wrong set of target values. The error indicates a length mismatch.
model = LogisticRegression()
model.fit(X_train, y_test)
```
%%%% Output: error