Pandas — Merge Datasets#
In this lesson, we’re going to demonstrate how to merge two datasets with Pandas. We’re going to walk through a real-world example of how I merged two different datasets published by The Pudding to create the CSV file that we used in the previous lesson.
Dataset#
The Pudding’s Film Dialogue Data#
When Hannah Andersen and Matt Daniels published the data behind their Pudding essay about film dialogue, they published this data as a few different CSV files. But I wanted to combine them into a single CSV file.
Import Pandas
To use the Pandas library, we first need to import
it.
import pandas as pd
The above import
statement not only imports the Pandas library but also gives it an alias or nickname — pd
. This alias will save us from having to type out the entire words pandas
each time we need to use it. Many Python libraries have commonly used aliases like pd
.
Set Display Settings
By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.
pd.options.display.max_rows = 100
Read in CSV File
To read in a CSV file, we will use the function pd.read_csv()
and insert the name of our desired file path.
Hannah Andersen and Matt Daniels published one CSV file called “meta_data7.csv” that contains, among things, the title of each movie, the year of its release, and its box office gross.
metadata = pd.read_csv('../data/Pudding/meta_data7.csv', delimiter=',', encoding='utf-8')
metadata
script_id | imdb_id | title | year | gross | lines_data | |
---|---|---|---|---|---|---|
0 | 1534 | tt1022603 | (500) Days of Summer | 2009 | 37.0 | 7435445256774780000000000000000000000000000000... |
1 | 1512 | tt0147800 | 10 Things I Hate About You | 1999 | 65.0 | 1777752320274530000000000000000000000000000000... |
2 | 1514 | tt0417385 | 12 and Holding | 2005 | NaN | 5461357777754220000000000000000000000000000000... |
3 | 1517 | tt2024544 | 12 Years a Slave | 2013 | 60.0 | 4567334777777780000000000000000000000000000000... |
4 | 1520 | tt1542344 | 127 Hours | 2010 | 20.0 | 453513352345766000000000000000 |
... | ... | ... | ... | ... | ... | ... |
1995 | 3765 | tt0403702 | Youth in Revolt | 2009 | 17.0 | 7766777656545340000000000000000000000000000000... |
1996 | 3766 | tt1790885 | Zero Dark Thirty | 2012 | 104.0 | 5677677556654470000000000000000000000000000000... |
1997 | 8158 | tt0120906 | Zero Effect | 1998 | 3.0 | 4777774477777650000000000000000000000000000000... |
1998 | 3768 | tt0421090 | Zerophilia | 2005 | NaN | 7647774446763350000000000000000000000000000000... |
1999 | 6491 | tt0443706 | Zodiac | 2007 | 41.0 | 2447767766763570000000000000000000000000000000... |
2000 rows × 6 columns
We’re going to drop the column “lines_data”, which contains information about when during the film each character speaks.
metadata = metadata.drop(columns='lines_data')
They published another CSV file called “character_list5.csv” that contains, among other things, the name, gender, and age of each character as well as the number of words the character speaks.
characters = pd.read_csv('../data/Pudding/character_list5.csv', delimiter=',',encoding='utf-8')
characters
script_id | imdb_character_name | words | gender | age | |
---|---|---|---|---|---|
0 | 280 | betty | 311 | f | 35.0 |
1 | 280 | carolyn johnson | 873 | f | NaN |
2 | 280 | eleanor | 138 | f | NaN |
3 | 280 | francesca johns | 2251 | f | 46.0 |
4 | 280 | madge | 190 | f | 46.0 |
... | ... | ... | ... | ... | ... |
23043 | 9254 | lumiere | 1063 | m | 56.0 |
23044 | 9254 | maurice | 1107 | m | 71.0 |
23045 | 9254 | monsieur d'arqu | 114 | m | 58.0 |
23046 | 9254 | mrs. potts | 564 | f | 66.0 |
23047 | 9254 | wardrobe | 121 | f | 54.0 |
23048 rows × 5 columns
As you can see, the characters DataFrame doesn’t include the actual title of the movie in which the character appears or the movie’s release year or box office gross. And the metadata datafram doesn’t contain any information about the characters. We want that info all in one place. So how can we combine all of this data together?
Merge Datasets#
If you look closely, there’s one column that both datasets share in common: “script_id”. If two datasets share at least one column in common, we can merge them together based on this column.
We can use the pd.merge()
function and type in the name of the first dataframe, the name of the second dataframe, and the shared column to be merged on.
pd.merge(characters, metadata, on='script_id')
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | lines_data | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 280 | betty | 311 | f | 35.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 4332023434343450000000000000000000000000000000... |
1 | 280 | carolyn johnson | 873 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 4332023434343450000000000000000000000000000000... |
2 | 280 | eleanor | 138 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 4332023434343450000000000000000000000000000000... |
3 | 280 | francesca johns | 2251 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 4332023434343450000000000000000000000000000000... |
4 | 280 | madge | 190 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 4332023434343450000000000000000000000000000000... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23043 | 9254 | lumiere | 1063 | m | 56.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 3245753334377770000000000000000000000000000000... |
23044 | 9254 | maurice | 1107 | m | 71.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 3245753334377770000000000000000000000000000000... |
23045 | 9254 | monsieur d'arqu | 114 | m | 58.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 3245753334377770000000000000000000000000000000... |
23046 | 9254 | mrs. potts | 564 | f | 66.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 3245753334377770000000000000000000000000000000... |
23047 | 9254 | wardrobe | 121 | f | 54.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 3245753334377770000000000000000000000000000000... |
23048 rows × 10 columns
Now we have a combined DataFrame with character and film information.
merged_movie_character = pd.merge(characters, metadata, on='script_id')
Calculate Dialogue Proportions#
We’re going to add one more column to this dataset before the next lesson. We’re going to calculate the proportion of words spoken in each film by each character. To do so, we’re going to .groupby()
the movie’s title and calculate the sum total number of words spoken in each movie.
merged_movie_character.groupby('title')[['words']].sum()
words | |
---|---|
title | |
(500) Days of Summer | 18500 |
10 Things I Hate About You | 19680 |
12 Years a Slave | 19628 |
12 and Holding | 15968 |
127 Hours | 5145 |
... | ... |
Zero Effect | 13927 |
Zerophilia | 16686 |
Zodiac | 14656 |
eXistenZ | 9447 |
xXx | 8285 |
1994 rows × 1 columns
If we use the .transform()
, we can turn this groupby into a single column of data.
merged_movie_character.groupby(['title'])[['words']].transform(sum)
words | |
---|---|
0 | 6394 |
1 | 6394 |
2 | 6394 |
3 | 6394 |
4 | 6394 |
... | ... |
23043 | 10159 |
23044 | 10159 |
23045 | 10159 |
23046 | 10159 |
23047 | 10159 |
23048 rows × 1 columns
total_movie_words = merged_movie_character.groupby(['title'])[['words']].transform(sum)
Then we’re going to divide the total number of words spoken by each character by the total number of words spoken in each film.
total_character_words = merged_movie_character[['words']]
total_character_words / total_movie_words
words | |
---|---|
0 | 0.048639 |
1 | 0.136534 |
2 | 0.021583 |
3 | 0.352049 |
4 | 0.029715 |
... | ... |
23043 | 0.104636 |
23044 | 0.108967 |
23045 | 0.011222 |
23046 | 0.055517 |
23047 | 0.011911 |
23048 rows × 1 columns
dialogue_proportion = total_character_words / total_movie_words
Then we’re going to add it as a new column.
merged_movie_character['proportion_of_dialogue'] = dialogue_proportion
merged_movie_character
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 280 | betty | 311 | f | 35.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.048639 |
1 | 280 | carolyn johnson | 873 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.136534 |
2 | 280 | eleanor | 138 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.021583 |
3 | 280 | francesca johns | 2251 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.352049 |
4 | 280 | madge | 190 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.029715 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23043 | 9254 | lumiere | 1063 | m | 56.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.104636 |
23044 | 9254 | maurice | 1107 | m | 71.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.108967 |
23045 | 9254 | monsieur d'arqu | 114 | m | 58.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.011222 |
23046 | 9254 | mrs. potts | 564 | f | 66.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.055517 |
23047 | 9254 | wardrobe | 121 | f | 54.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.011911 |
23048 rows × 10 columns
Write to CSV File#
Finally, we’re going to output this merged and more comprehensive dataset to a CSV file by using the .to_csv()
method. We set the index
parameter to False
to remove the index column (the numbers in the left-most column).
merged_movie_character.to_csv('../data/Pudding/Merged-Pudding-Film-Dialogue.csv', encoding='utf-8', index=False)