Unverified Commit 03ac5b22 authored by Simon Bowly's avatar Simon Bowly
Browse files

Add alternative command for centered rolling mean.

parent da62bcaf
%% Cell type:markdown id: tags:
# Handling Timeseries Data in Pandas
* Topic: Data manipulation
* Unit: ADS1002
* Level: Beginner
* Authors: Simon Bowly
* Version: 3
* Version: 3.1
Required files (download these from the Gitlab site [here](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Pandas-DataFrames/Time-Series) into the same directory as the notebook on your computer):
* [traffic-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Pandas-DataFrames/Time-Series/traffic-data.csv)
The objective of this notebook is to give some background to time series data and to introduce you to the Pandas commands needed to manipulate time series. Most of the projects this semester involve working with some form of time series, and you will most likely need to filter, plot, and aggregate various parts of the data along the way. If your data fits this description, spend some time trying the commands introduced here with your dataset.
%% Cell type:markdown id: tags:
## Time Series Data
A time series is any dataset which varies over time. This usually means one or more measurements taken at some repeating interval. Some datasets you will work with will be measured at low frequencies, e.g. daily temperature forecasts, monthly sales numbers, and annual revenue, while others may be measured at high frequencies, for example hourly power usage, or millisecond frequency data from medical imaging equipment.
Pandas will handle time series data well if it is in the right format. It expects that rows in a dataframe are time-based observations, and columns are different measurements taken at the same time. As an example, each data `Series` (single column) should be structured like this:
%% Cell type:code id: tags:
``` python
import pandas as pd
import matplotlib.pyplot as plt
# Create a datetime index, with 6 hourly timestamps.
index = pd.date_range(start="2020-03-02", periods=6, freq='H')
# Some data (just some numbers increasing over time).
data = [1, 2, 3, 4, 5, 6]
# Construct a series using the index and data.
series = pd.Series(data, index=index)
# Show the index and the series.
print(index)
series
```
%%%% Output: stream
DatetimeIndex(['2020-03-02 00:00:00', '2020-03-02 01:00:00',
'2020-03-02 02:00:00', '2020-03-02 03:00:00',
'2020-03-02 04:00:00', '2020-03-02 05:00:00'],
dtype='datetime64[ns]', freq='H')
%%%% Output: execute_result
2020-03-02 00:00:00 1
2020-03-02 01:00:00 2
2020-03-02 02:00:00 3
2020-03-02 03:00:00 4
2020-03-02 04:00:00 5
2020-03-02 05:00:00 6
Freq: H, dtype: int64
%% Cell type:markdown id: tags:
and a multi-column `DataFrame` should be structured like this:
%% Cell type:code id: tags:
``` python
# A datetime index, with 6 hourly timestamps.
index = pd.date_range(start="2020-03-02", periods=6, freq='H')
# Construct a dataframe (two columns) using the index and data.
dataframe = pd.DataFrame(
data={
"field1": [1, 2, 3, 4, 5, 6],
"field2": ["a", "b", "c", "d", "e", "f"],
},
index=index,
)
dataframe
```
%%%% Output: execute_result
field1 field2
2020-03-02 00:00:00 1 a
2020-03-02 01:00:00 2 b
2020-03-02 02:00:00 3 c
2020-03-02 03:00:00 4 d
2020-03-02 04:00:00 5 e
2020-03-02 05:00:00 6 f
%% Cell type:markdown id: tags:
Of course, you'll mostly be loading prepared datasets from another format, rather than constructing them this way, but it is useful to know how these datasets are assembled.
The examples below use road traffic data from several intersections around Melbourne. Values in the dataset represent the number of cars counted at a specific intersection in a given 15 minute period. We'll look at:
* converting types to ensure we have the correct time series layout;
* filtering subsets of the data;
* plotting the data against time; and
* grouping & aggregating data using time series functions.
%% Cell type:markdown id: tags:
## Loading Data
First, we load the dataset from csv. To use pandas to its fullest potential here, we need to ensure two things:
1. the ata is in a format where each row is a single time step, and each column is a different observation at that time; and
2. data types are set appropriately (in particular, the `datetime` data type is used to index the data).
**Please note** that it may take more work to get your particular dataset into this format, as some time series data is stored in a different layout. I haven't included specific examples here (since each case may be different) so ask the teaching staff how to approach your particular case.
%% Cell type:code id: tags:
``` python
raw_traffic_data = pd.read_csv("traffic-data.csv")
raw_traffic_data
```
%%%% Output: execute_result
Timestamp Site 100 Site 101 Site 102 Site 103 Site 105 \
0 2021-01-01 00:00:00 66.0 18.0 46.0 19.0 97.0
1 2021-01-01 00:15:00 124.0 83.0 88.0 57.0 231.0
2 2021-01-01 00:30:00 121.0 102.0 83.0 91.0 252.0
3 2021-01-01 00:45:00 130.0 115.0 112.0 59.0 279.0
4 2021-01-01 01:00:00 160.0 115.0 107.0 49.0 226.0
... ... ... ... ... ... ...
2971 2021-01-31 22:45:00 79.0 51.0 70.0 0.0 153.0
2972 2021-01-31 23:00:00 94.0 44.0 40.0 0.0 148.0
2973 2021-01-31 23:15:00 61.0 36.0 42.0 0.0 120.0
2974 2021-01-31 23:30:00 59.0 45.0 42.0 0.0 95.0
2975 2021-01-31 23:45:00 61.0 30.0 41.0 18.0 72.0
Site 106 Site 107 Site 108 Site 109
0 39.0 66.0 31.0 51.0
1 79.0 207.0 112.0 109.0
2 96.0 272.0 106.0 143.0
3 89.0 278.0 87.0 126.0
4 86.0 235.0 98.0 111.0
... ... ... ... ...
2971 45.0 145.0 55.0 56.0
2972 31.0 134.0 54.0 43.0
2973 34.0 118.0 38.0 42.0
2974 23.0 96.0 40.0 25.0
2975 11.0 70.0 26.0 17.0
[2976 rows x 10 columns]
%% Cell type:markdown id: tags:
This dataset is already in the correct layout: each row corresponds to a single timestamp, and there are multiple measurements (columns). All that remains is to check the data types.
%% Cell type:code id: tags:
``` python
# Show data types for all columns in the dataframe.
raw_traffic_data.dtypes
```
%%%% Output: execute_result
Timestamp object
Site 100 float64
Site 101 float64
Site 102 float64
Site 103 float64
Site 105 float64
Site 106 float64
Site 107 float64
Site 108 float64
Site 109 float64
dtype: object
%% Cell type:markdown id: tags:
We can see above that our count values are in `float` format, which is fine, but the 'Timestamp' column is in object format. Looking a little more closely we will see that these are actually stored as strings.
%% Cell type:code id: tags:
``` python
# Get the 'Timestamp' column from the first row.
single_timestamp = raw_traffic_data["Timestamp"].iloc[0]
# Print the value and it's type.
print("The first timestamp has value '{}' and data type {}".format(single_timestamp, type(single_timestamp)))
```
%%%% Output: stream
The first timestamp has value '2021-01-01 00:00:00' and data type <class 'str'>
%% Cell type:markdown id: tags:
This is a typical outcome when loading data from csv. In most cases, this will be easy enough to convert using `pd.to_datetime`. We also want the timestamps to be an index in the data, rather than a regular column. So we create a new dataframe as follows:
%% Cell type:code id: tags:
``` python
# For reference; this is the result of a datetime conversion.
# Note that the 'dtype' for this converted data is now 'datetime64'.
pd.to_datetime(raw_traffic_data["Timestamp"])
```
%%%% Output: execute_result
0 2021-01-01 00:00:00
1 2021-01-01 00:15:00
2 2021-01-01 00:30:00
3 2021-01-01 00:45:00
4 2021-01-01 01:00:00
...
2971 2021-01-31 22:45:00
2972 2021-01-31 23:00:00
2973 2021-01-31 23:15:00
2974 2021-01-31 23:30:00
2975 2021-01-31 23:45:00
Name: Timestamp, Length: 2976, dtype: datetime64[ns]
%% Cell type:code id: tags:
``` python
# Create a new index from the timestamp column, with the proper type.
traffic_data = raw_traffic_data.set_index(pd.to_datetime(raw_traffic_data["Timestamp"]))
# Delete the column with our old string representation of times.
traffic_data = traffic_data.drop(columns=["Timestamp"])
# Show the index and the dataframe.
print(traffic_data.index)
traffic_data
```
%%%% Output: stream
DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 00:15:00',
'2021-01-01 00:30:00', '2021-01-01 00:45:00',
'2021-01-01 01:00:00', '2021-01-01 01:15:00',
'2021-01-01 01:30:00', '2021-01-01 01:45:00',
'2021-01-01 02:00:00', '2021-01-01 02:15:00',
...
'2021-01-31 21:30:00', '2021-01-31 21:45:00',
'2021-01-31 22:00:00', '2021-01-31 22:15:00',
'2021-01-31 22:30:00', '2021-01-31 22:45:00',
'2021-01-31 23:00:00', '2021-01-31 23:15:00',
'2021-01-31 23:30:00', '2021-01-31 23:45:00'],
dtype='datetime64[ns]', name='Timestamp', length=2976, freq=None)
%%%% Output: execute_result
Site 100 Site 101 Site 102 Site 103 Site 105 \
Timestamp
2021-01-01 00:00:00 66.0 18.0 46.0 19.0 97.0
2021-01-01 00:15:00 124.0 83.0 88.0 57.0 231.0
2021-01-01 00:30:00 121.0 102.0 83.0 91.0 252.0
2021-01-01 00:45:00 130.0 115.0 112.0 59.0 279.0
2021-01-01 01:00:00 160.0 115.0 107.0 49.0 226.0
... ... ... ... ... ...
2021-01-31 22:45:00 79.0 51.0 70.0 0.0 153.0
2021-01-31 23:00:00 94.0 44.0 40.0 0.0 148.0
2021-01-31 23:15:00 61.0 36.0 42.0 0.0 120.0
2021-01-31 23:30:00 59.0 45.0 42.0 0.0 95.0
2021-01-31 23:45:00 61.0 30.0 41.0 18.0 72.0
Site 106 Site 107 Site 108 Site 109
Timestamp
2021-01-01 00:00:00 39.0 66.0 31.0 51.0
2021-01-01 00:15:00 79.0 207.0 112.0 109.0
2021-01-01 00:30:00 96.0 272.0 106.0 143.0
2021-01-01 00:45:00 89.0 278.0 87.0 126.0
2021-01-01 01:00:00 86.0 235.0 98.0 111.0
... ... ... ... ...
2021-01-31 22:45:00 45.0 145.0 55.0 56.0
2021-01-31 23:00:00 31.0 134.0 54.0 43.0
2021-01-31 23:15:00 34.0 118.0 38.0 42.0
2021-01-31 23:30:00 23.0 96.0 40.0 25.0
2021-01-31 23:45:00 11.0 70.0 26.0 17.0
[2976 rows x 9 columns]
%% Cell type:markdown id: tags:
Success! The data is now in native pandas time series format: the index is a `DatetimeIndex` and the remaining columns contain only our observed data, not the timestamp values.
%% Cell type:markdown id: tags:
## Selecting Data
Generally you will want to work with a subset of the data, particularly when it comes to visualisation. The following commands show how to select subsets of the data by specifying two dates as endpoints.
%% Cell type:code id: tags:
``` python
# Select a 2 hour period between two dates.
# The syntax is similar to other pandas commands - start and end of the interval,
# in square brackets, separated by a colon.
morning_data = traffic_data[pd.Timestamp("2021-01-04 06:00:00"):pd.Timestamp("2021-01-04 08:00:00")]
morning_data
```
%%%% Output: execute_result
Site 100 Site 101 Site 102 Site 103 Site 105 \
Timestamp
2021-01-04 06:00:00 70.0 84.0 141.0 64.0 224.0
2021-01-04 06:15:00 108.0 94.0 184.0 85.0 269.0
2021-01-04 06:30:00 157.0 114.0 198.0 107.0 331.0
2021-01-04 06:45:00 132.0 127.0 227.0 101.0 354.0
2021-01-04 07:00:00 143.0 142.0 256.0 110.0 355.0
2021-01-04 07:15:00 167.0 165.0 243.0 119.0 372.0
2021-01-04 07:30:00 217.0 156.0 298.0 127.0 512.0
2021-01-04 07:45:00 240.0 204.0 346.0 150.0 606.0
2021-01-04 08:00:00 192.0 174.0 322.0 143.0 479.0
Site 106 Site 107 Site 108 Site 109
Timestamp
2021-01-04 06:00:00 87.0 241.0 72.0 82.0
2021-01-04 06:15:00 149.0 324.0 72.0 135.0
2021-01-04 06:30:00 152.0 378.0 109.0 180.0
2021-01-04 06:45:00 138.0 355.0 107.0 171.0
2021-01-04 07:00:00 142.0 415.0 123.0 181.0
2021-01-04 07:15:00 166.0 478.0 139.0 203.0
2021-01-04 07:30:00 188.0 484.0 158.0 239.0
2021-01-04 07:45:00 184.0 570.0 181.0 239.0
2021-01-04 08:00:00 176.0 603.0 190.0 239.0
%% Cell type:markdown id: tags:
An alternative way to write this is to use 'timedelta' to specify a start time and duration of the interval.
%% Cell type:code id: tags:
``` python
# Store a fixed start time in a variable.
start = pd.Timestamp("2021-01-04 06:00:00")
# Add an offset to use as the end time.note, you can use any time unit here; try e.g. days=2)
# and check the range of values returned.
traffic_data[start:start + pd.Timedelta(hours=1)]
```
%%%% Output: execute_result
Site 100 Site 101 Site 102 Site 103 Site 105 \
Timestamp
2021-01-04 06:00:00 70.0 84.0 141.0 64.0 224.0
2021-01-04 06:15:00 108.0 94.0 184.0 85.0 269.0
2021-01-04 06:30:00 157.0 114.0 198.0 107.0 331.0
2021-01-04 06:45:00 132.0 127.0 227.0 101.0 354.0
2021-01-04 07:00:00 143.0 142.0 256.0 110.0 355.0
Site 106 Site 107 Site 108 Site 109
Timestamp
2021-01-04 06:00:00 87.0 241.0 72.0 82.0
2021-01-04 06:15:00 149.0 324.0 72.0 135.0
2021-01-04 06:30:00 152.0 378.0 109.0 180.0
2021-01-04 06:45:00 138.0 355.0 107.0 171.0
2021-01-04 07:00:00 142.0 415.0 123.0 181.0
%% Cell type:markdown id: tags:
Of course we can also select columns as normal, along with selecting a time range.
%% Cell type:code id: tags:
``` python
start = pd.Timestamp("2021-01-04 06:00:00")
end = start + pd.Timedelta(hours=1)
# Separate the row and column filters by a comma. Also note that
# we must use .loc[] syntax when specifying both rows and columns.
traffic_data.loc[start:end, ["Site 100", "Site 101"]]
```
%%%% Output: execute_result
Site 100 Site 101
Timestamp
2021-01-04 06:00:00 70.0 84.0
2021-01-04 06:15:00 108.0 94.0
2021-01-04 06:30:00 157.0 114.0
2021-01-04 06:45:00 132.0 127.0
2021-01-04 07:00:00 143.0 142.0
%% Cell type:markdown id: tags:
# Plotting
Pandas has an in-built set of plotting methods (some of which you've seen before). While `seaborn` is most useful for plotting relationships between data, `pandas` own plotting commands are easier to work with when visualising time series. We'll use a subset of the data in these plots as plotting the entire dataset becomes very cluttered.
%% Cell type:code id: tags:
``` python
start = pd.Timestamp("2021-01-04 06:00:00")
end = start + pd.Timedelta(hours=6)
morning_data_three_sites = traffic_data.loc[start:end, ["Site 100", "Site 101", "Site 103"]]
# Plots each column as a different line, over the time range in the filtered dataframe.
morning_data_three_sites.plot.line();
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Other types than the standard line plot can also be produced using the same style. An area plot gives a 'stacked' view of the data (each series is added on top of one another). This helps to show both the individual series' relative sizes as well as the total of all series at a given time.
Note you can use the `figsize` argument in any of these plot commands to expand the area of the plot in the notebook. The argument to figsize is a (width, height) tuple.
%% Cell type:code id: tags:
``` python
morning_data_three_sites.plot.area(figsize=(10, 5));
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Resampling
Resampling helps to aggregate data based on time. It is used when we have relatively high frequency data and need to 'downsample' to more useful statistics at a lower frequency. A useful result for this dataset might be to get daily traffic numbers by adding up all the 15 minute measurement windows.
Note that any of the usual statistics (e.g. min/max/mean/median...) can be used in place of `.sum()` here, to calculate different statistics.
%% Cell type:code id: tags:
``` python
three_sites = traffic_data[["Site 103", "Site 105", "Site 106"]]
# Compute the sum of all 15 minute counts over each day.
# 1D = group by one day periods
daily = three_sites.resample("1D").sum()
daily.head()
```
%%%% Output: execute_result
Site 103 Site 105 Site 106
Timestamp
2021-01-01 10107.0 25724.0 11707.0
2021-01-02 12073.0 33606.0 14556.0
2021-01-03 11007.0 31182.0 12367.0
2021-01-04 12816.0 40829.0 15377.0
2021-01-05 13312.0 42254.0 16466.0
%% Cell type:markdown id: tags:
Note that the resulting dataframe now has an index with only daily entries, no specific times. Plotting the result as normal allows us to show a visual comparison of overall daily traffic at each site.
%% Cell type:code id: tags:
``` python
daily.plot.line();
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
## Rolling Mean
Rolling statistics (especially rolling averages) can be useful to smooth out time series data, correcting for minor fluctuations or measurement issues in order to observe general trends. They can also be used to compute "anomaly" measures, which compare observed values at a single time period to long- or short-term averages. A rolling mean takes a 'window' around each timestamp, of a given duration, and computes the average of all values in the window.
Computing rolling means is straightforward since our data is in the correct format. The code below plots the data against a rolling mean. Note that the 'centered' rolling mean is perhaps more informative here, since it will use 2 hours either side. The default rolling mean is computed based on 4 hours of data before the current time.
%% Cell type:code id: tags:
``` python
start = pd.Timestamp("2021-01-04 06:00:00")
end = start + pd.Timedelta(days=2)
# This gives us just a single series (for one site).
site_105 = traffic_data.loc[start:end, "Site 105"]
# Compute rolling means (4H = 4 hour rolling window).
rolling_mean = site_105.rolling('4H').mean()
centered_rolling_mean = site_105.rolling('4H', center=True).mean()
# These two commands (the first is commented out since it only
# works in newer pandas versions) give a centered rolling mean. It
# is more convenient to use '4H' here, but 16 time periods (since
# the data is at 15 minute frequency is equivalent in this case.
# centered_rolling_mean = site_105.rolling('4H', center=True).mean()
centered_rolling_mean = site_105.rolling(16, center=True).mean()
# Plot each series. We can use the 'label' argument to label them
# individually since they are separate series.
site_105.plot.line(label='Data', figsize=(10, 4))
rolling_mean.plot.line(label='Rolling Mean')
centered_rolling_mean.plot.line(label='Centered Rolling Mean')
# Add a legend to show the labels.
plt.legend();
```
%%%% Output: display_data
![]()
![](