Unverified Commit dcc91dfb authored by Simon Bowly's avatar Simon Bowly
Browse files

Clean up pandas merging notebooks.

parent 9d5b0c55
This source diff could not be displayed because it is too large. You can view the blob instead.
%% Cell type:code id: tags:
``` python
import pandas as pd
exams = pd.read_csv('ExamMark.csv')
ca = pd.read_csv('CAMark.csv')
```
%% Cell type:code id: tags:
``` python
print(ca)
newrow = pd.DataFrame({'Student ID':[1537],'CA Mark':[55]})
ca = ca.append(newrow,ignore_index=True)
print(ca)
newcolumn = pd.DataFrame([1,1,2,1,3,2],columns=['Course Code'])
print(newcolumn)
nca = pd.concat([ca,newcolumn],axis=1)
print(nca)
coursedict = pd.DataFrame({'Course Code':[1,2,3],
'Degree':['Bachelor of Science','Bachelor of Engineering','Bachelor of IT']})
print(coursedict)
```
%%%% Output: stream
Student ID CA Mark
0 6789 62
1 7410 8
2 7634 34
3 9016 69
4 9532 5
Student ID CA Mark
0 6789 62
1 7410 8
2 7634 34
3 9016 69
4 9532 5
5 1537 55
Course Code
0 1
1 1
2 2
3 1
4 3
5 2
Student ID CA Mark Course Code
0 6789 62 1
1 7410 8 1
2 7634 34 2
3 9016 69 1
4 9532 5 3
5 1537 55 2
Course Code Degree
0 1 Bachelor of Science
1 2 Bachelor of Engineering
2 3 Bachelor of IT
%% Cell type:code id: tags:
``` python
# merged = pd.merge(exams,ca)
merged = pd.merge(nca,coursedict,on='Course Code')
merged
```
%%%% Output: execute_result
Student ID CA Mark Course Code Degree
0 6789 62 1 Bachelor of Science
1 7410 8 1 Bachelor of Science
2 9016 69 1 Bachelor of Science
3 7634 34 2 Bachelor of Engineering
4 1537 55 2 Bachelor of Engineering
5 9532 5 3 Bachelor of IT
%% Cell type:code id: tags:
``` python
ca.columns = ['ID','CA Mark']
```
%% Cell type:code id: tags:
``` python
ca
```
%%%% Output: execute_result
ID CA Mark
0 6789 62
1 7410 8
2 7634 34
3 9016 69
4 9532 5
%% Cell type:code id: tags:
``` python
merged = pd.merge(exams,ca,left_on='Student ID',right_on='ID').drop('ID',axis=1)
```
%% Cell type:code id: tags:
``` python
merged
```
%%%% Output: execute_result
Student ID Firstname Lastname Exam Mark CA Mark
0 7634 James Brown 52 34
1 6789 Ella Fitzgerald 73 62
2 7410 Herbie Hancock 9 8
3 9016 Dolly Parton 87 69
4 9532 Keith Richards 81 5
%% Cell type:code id: tags:
``` python
newrow = pd.DataFrame({'ID':[1537],'CA Mark':[55]})
ca = ca.append(newrow)
ca['ID'] = ca['ID'].astype(str)
exams['Student ID'] = exams['Student ID'].astype(str)
ca1 = ca.set_index('ID').drop('9532')
exams1 = exams.set_index('Student ID')
print(ca1)
print(exam1)
```
%%%% Output: stream
CA Mark
ID
6789 62
7410 8
7634 34
9016 69
1537 55
1537 55
Firstname Lastname Exam Mark
Student ID
7634 James Brown 52
6789 Ella Fitzgerald 73
7410 Herbie Hancock 9
9016 Dolly Parton 87
9532 Keith Richards 81
%% Cell type:code id: tags:
``` python
merged = pd.merge(exams1,ca1,how='left',left_index=True,right_index=True)
merged
```
%%%% Output: execute_result
Firstname Lastname Exam Mark CA Mark
6789 Ella Fitzgerald 73 62.0
7410 Herbie Hancock 9 8.0
7634 James Brown 52 34.0
9016 Dolly Parton 87 69.0
9532 Keith Richards 81 NaN
%% Cell type:code id: tags:
``` python
merged = pd.merge(exams1,ca1,how='right',left_index=True,right_index=True)
merged
```
%%%% Output: execute_result
Firstname Lastname Exam Mark CA Mark
1537 NaN NaN NaN 55
1537 NaN NaN NaN 55
6789 Ella Fitzgerald 73.0 62
7410 Herbie Hancock 9.0 8
7634 James Brown 52.0 34
9016 Dolly Parton 87.0 69
%% Cell type:code id: tags:
``` python
merged = pd.merge(exams1,ca1,how='inner',left_index=True,right_index=True)
merged
```
%%%% Output: execute_result
Firstname Lastname Exam Mark CA Mark
6789 Ella Fitzgerald 73 62
7410 Herbie Hancock 9 8
7634 James Brown 52 34
9016 Dolly Parton 87 69
%% Cell type:code id: tags:
``` python
merged = pd.merge(exams1,ca1,how='outer',left_index=True,right_index=True)
merged
```
%%%% Output: execute_result
Firstname Lastname Exam Mark CA Mark
1537 NaN NaN NaN 55.0
1537 NaN NaN NaN 55.0
6789 Ella Fitzgerald 73.0 62.0
7410 Herbie Hancock 9.0 8.0
7634 James Brown 52.0 34.0
9016 Dolly Parton 87.0 69.0
9532 Keith Richards 81.0 NaN
%% Cell type:code id: tags:
``` python
Generation = pd.read_csv('EnergyGeneration.csv')
Usage = pd.read_csv('EnergyUsage.csv')
Generation['Time'] = pd.to_datetime(Generation['Time'], format="%I:%M:%S %p")
Usage['Time'] = pd.to_datetime(Usage['Time'], format="%I:%M:%S %p")
print(Usage.head())
print(Generation.head())
```
%%%% Output: stream
Time Energy Usage (kW)
0 1900-01-01 00:00:00 0.001851
1 1900-01-01 01:00:00 0.018957
2 1900-01-01 02:00:00 0.137190
3 1900-01-01 03:00:00 0.701568
4 1900-01-01 04:00:00 2.535264
Time Energy Generation (kW)
0 1900-01-01 00:00:00 0.0
1 1900-01-01 00:15:00 0.0
2 1900-01-01 00:30:00 0.0
3 1900-01-01 00:45:00 0.0
4 1900-01-01 01:00:00 0.0
%% Cell type:code id: tags:
``` python
merged = pd.merge(Usage,Generation,on='Time',how='left')
merged['Time'] = pd.to_datetime(merged['Time']).dt.strftime('%H:%M:%S')
merged
```
%%%% Output: execute_result
Time Energy Usage (kW) Energy Generation (kW)
0 00:00:00 0.001851 0.000000
1 01:00:00 0.018957 0.000000
2 02:00:00 0.137190 0.000000
3 03:00:00 0.701568 0.000000
4 04:00:00 2.535264 0.000000
5 05:00:00 6.474291 0.000000
6 06:00:00 11.684480 0.000000
7 07:00:00 14.908292 0.000000
8 08:00:00 13.473404 0.000000
9 09:00:00 8.729661 7.456000
10 10:00:00 4.409937 13.811556
11 11:00:00 2.714658 17.944889
12 12:00:00 3.655001 19.856000
13 13:00:00 6.607191 19.544889
14 14:00:00 10.927057 17.011556
15 15:00:00 15.576404 12.256000
16 16:00:00 19.036357 5.278222
17 17:00:00 19.938368 0.000000
18 18:00:00 17.896786 0.000000
19 19:00:00 13.767015 0.000000
20 20:00:00 9.075775 0.000000
21 21:00:00 5.127515 0.000000
22 22:00:00 2.482615 0.000000
23 23:00:00 1.030128 0.000000
%% Cell type:code id: tags:
``` python
merged.plot()
```
%%%% Output: execute_result
<matplotlib.axes._subplots.AxesSubplot at 0x11cad63d0>
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Merging and Joining with Pandas
%% Cell type:markdown id: tags:
One of the tasks that needs to be mastered in manipulating data is to merge and join dataframes. This process may be familiar to those who have used `vlookup` in Excel to merge two spreadsheets. In `pandas` this can be performed using the `df.append()`, `pd.concat()` and `pd.merge()` functions. In this lesson we use two simple data sets to explain the basics of merging and joining with pandas.
%% Cell type:markdown id: tags:
## Contents
%% Cell type:markdown id: tags:
* Merging unordered dataframes
* Merging timeseries
* Exercises
%% Cell type:markdown id: tags:
## Merging unordered dataframes
%% Cell type:markdown id: tags:
The first data sets relate to the marks for a hypothetical Musicology unit at Ashmon University. We have ExamMark.csv with a set of ID numbers and exam marks out of 100. Similarly, there is CAMark.csv with ID numbers and continuous assessment mark out of 80. Then there is StudentNames.csv which contains the ID numbers and the full names of the students who enrolled in this unit at the beginning of semester. The student names are kept separate from the marks until the marks are finalised, since all marking has been done anonymously. There is a different range of student records in each file, since some dropped out before the end of semester or didn't complete the continuous assesment or exam. The exam and continuous assessment are each worth 50% of the final mark, and we want to create a new file which includes the names and marks for all students who completed the unit.
%% Cell type:code id: tags:
``` python
import pandas as pd
exams = pd.read_csv('ExamMark.csv')
ca = pd.read_csv('CAMark.csv')
studentnames = pd.read_csv('StudentNames.csv')
print(exams)
print(ca)
print(studentnames)
```
%%%% Output: stream
Student ID Exam Mark
0 7634 52
1 6789 73
2 9016 87
3 9532 81
4 8318 43
Student ID CA Mark
0 6789 62
1 7410 8
2 7634 34
3 9016 69
4 8318 65
ID Number Firstname Lastname
0 7634 James Brown
1 6789 Ella Fitzgerald
2 7410 Herbie Hancock
3 9016 Dolly Parton
4 9532 Keith Richards
5 2888 Thelonius Monk
%% Cell type:markdown id: tags:
The first problem that we can see is that there is a student with an exam mark and CA mark, but not listed in the student names. This is probably due to a late enrollment in the unit. Luckily we can look up these details and append this to the dataframe. The other problem is that we need the course code for each student, so we also need to append this to the dataframe. We will first examine how this can be done using `df.append()` and `pd.concat()`.
`df.append()` is just used for appending rows to a dataframe. The original dataframe and the new dataframe do not necessarily need to have the same columns. If they don't, the missing values will be filled with NaN.
We set `ignore_index=True` here so that the new dataframe recreates the indices. Otherwise the indices from the original dataframes are used, and there is the possibly of rows with the same indices.
%% Cell type:code id: tags:
``` python
newrow = pd.DataFrame({'ID Number':[8318, 2718],
'Firstname':['Nina', 'Chick'],
'Lastname':['Simone', 'Corea']})
studentnames.append([newrow], ignore_index=True)
```
%%%% Output: execute_result
ID Number Firstname Lastname
0 7634 James Brown
1 6789 Ella Fitzgerald
2 7410 Herbie Hancock
3 9016 Dolly Parton
4 9532 Keith Richards
5 2888 Thelonius Monk
6 8318 Nina Simone
7 2718 Chick Corea
%% Cell type:markdown id: tags:
The routine `pd.concat()` does the same type of joins of rows, but it also allows columns to be joined to form a new dataframe. The difference is now the dataframes that are to be merged must be specified as a list.
%% Cell type:code id: tags:
``` python
anotherrow = pd.DataFrame({'ID Number':[3141],'Firstname':['Dusty'],'Lastname':['Springfield']})
pd.concat([studentnames, newrow, anotherrow], ignore_index=True)
```
%%%% Output: execute_result
ID Number Firstname Lastname
0 7634 James Brown
1 6789 Ella Fitzgerald
2 7410 Herbie Hancock
3 9016 Dolly Parton
4 9532 Keith Richards
5 2888 Thelonius Monk
6 8318 Nina Simone
7 2718 Chick Corea
8 3141 Dusty Springfield
%% Cell type:markdown id: tags:
When merging the dataframes, we may want to add a `key` which is associated with the original dataframes. For example, if one set of student records corresponds to early enrollments, and the other to late enrollments, we can incorporate this by specifying keys. In this case we can keep the indices of the original dataframes.
%% Cell type:code id: tags:
``` python
pd.concat([studentnames, newrow], keys=['Early', 'Late'])
```
%%%% Output: execute_result
ID Number Firstname Lastname
Early 0 7634 James Brown
1 6789 Ella Fitzgerald
2 7410 Herbie Hancock
3 9016 Dolly Parton
4 9532 Keith Richards
5 2888 Thelonius Monk
Late 0 8318 Nina Simone
1 2718 Chick Corea
%% Cell type:markdown id: tags:
`pd.concat` also allows dataframes to have columns added by specifying `axis=1`. In this case the new dataframe is created so that rows with the same indices in the two original dataframes are used to create the new row. For example, consider concatenation of the `exams` and `ca` dataframes based on the indices.
%% Cell type:code id: tags:
``` python
print(exams)
print(ca)
pd.concat([exams,ca], axis=1)
```
%%%% Output: stream
Student ID Exam Mark
0 7634 52
1 6789 73
2 9016 87
3 9532 81
4 8318 43
Student ID CA Mark
0 6789 62
1 7410 8
2 7634 34
3 9016 69
4 8318 65
%%%% Output: execute_result
Student ID Exam Mark Student ID CA Mark
0 7634 52 6789 62
1 6789 73 7410 8
2 9016 87 7634 34
3 9532 81 9016 69
4 8318 43 8318 65
%% Cell type:markdown id: tags:
We now have two columns named `Student ID`. To correctly concatenate the two data frames we need to set the `Student ID` (which is a unique value in each row) as in the index in each dataframe, and then contenate based on these keys. This performs an 'outer join', which is equivalent to specifying `join='outer'`, and creates an entry for every index that occurs in either dataframe. The values that are missing are then filled with NaN.
%% Cell type:code id: tags:
``` python
newexams = exams.set_index('Student ID')
newca = ca.set_index('Student ID')
pd.concat([newexams, newca], axis=1)
```
%%%% Output: execute_result
Exam Mark CA Mark
Student ID
6789 73.0 62.0
7410 NaN 8.0
7634 52.0 34.0
8318 43.0 65.0
9016 87.0 69.0
9532 81.0 NaN
%% Cell type:markdown id: tags:
The alternative is to perform an 'inner join' by specifying `join='inner'` and then rows are only created if the indices occur in both dataframes.
%% Cell type:code id: tags:
``` python
pd.concat([newexams,newca], axis=1, join='inner')
```
%%%% Output: execute_result
Exam Mark CA Mark
Student ID
7634 52 34
6789 73 62
9016 87 69
8318 43 65
%% Cell type:markdown id: tags:
Now we will update `studentnames` so that it just includes the required missing rows.
%% Cell type:code id: tags:
``` python
newrow = pd.DataFrame({'ID Number': [8318],
'Firstname': ['Nina'],
'Lastname': ['Simone']})
studentnames = pd.concat([studentnames,newrow])
studentnames
```
%%%% Output: stream
ID Number Firstname Lastname
0 7634 James Brown
1 6789 Ella Fitzgerald
2 7410 Herbie Hancock
3 9016 Dolly Parton
4 9532 Keith Richards
5 2888 Thelonius Monk
0 8318 Nina Simone
%% Cell type:markdown id: tags:
We now need to add course codes to the dataframe. First we create a dataframe which has the course codes for each of our students.
%% Cell type:code id: tags:
``` python
coursecodes = pd.DataFrame([[9532,1],
[6789,1],
[2888,1],
[8318,2],
[7410,2],
[9016,3],
[7634,3]],
columns=['ID Number','Course Code'])
coursecodes
```
%%%% Output: execute_result
ID Number Course Code
0 9532 1
1 6789 1
2 2888 1
3 8318 2
4 7410 2
5 9016 3
6 7634 3
%% Cell type:markdown id: tags:
`pd.merge()` can be used to add the course codes to our student list. Note that `studentnames` and `coursecodes` have the common index 'ID Number'. Therefore if we merge them, by default this common index will be used to create the new dataframe.
%% Cell type:code id: tags:
``` python
studentnames = pd.merge(studentnames, coursecodes)
studentnames
```
%%%% Output: execute_result
ID Number Firstname Lastname Course Code
0 7634 James Brown 3
1 6789 Ella Fitzgerald 1
2 7410 Herbie Hancock 2
3 9016 Dolly Parton 3
4 9532 Keith Richards 1
5 2888 Thelonius Monk 1
6 8318 Nina Simone 2
%% Cell type:markdown id: tags:
We now want to merge the exams marks and the CA marks, which have the common index 'Student ID'. There are four ways we can do this by specifying the keyword `how`. 'left' merges are based on the rows of the left dataframe and if there are no corresponding values for that in right dataframe, these entries are filled with NaN. 'right' does the same, but based on the rows of the right data frame. 'inner' merges are based on the intersection of the two dataframes, i.e., the rows common to both. 'outer' merges are based on the union of the two data frames, i.e., rows that are in either or both dataframes. The default is the inner merge. The corresponding results for each merge for the marks are shown below.
Since the only common key in both dataframes is 'Student ID', the merges will be based on this. It doesn't need to be specified here, but we include it for demonstration. Here the left dataframe is 'exams', and the right dataframe is 'ca'.
%% Cell type:code id: tags:
``` python
# left merge
pd.merge(exams,ca,on='Student ID',how='left')