Pandas — Merge Datasets#

In this lesson, we’re going to demonstrate how to merge two datasets with Pandas. We’re going to walk through a real-world example of how I merged two different datasets published by The Pudding to create the CSV file that we used in the previous lesson.


Dataset#

The Pudding’s Film Dialogue Data#

When Hannah Andersen and Matt Daniels published the data behind their Pudding essay about film dialogue, they published this data as a few different CSV files. But I wanted to combine them into a single CSV file.


Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

The above import statement not only imports the Pandas library but also gives it an alias or nickname — pd. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd.

Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 100

Read in CSV File

To read in a CSV file, we will use the function pd.read_csv() and insert the name of our desired file path.

Hannah Andersen and Matt Daniels published one CSV file called “meta_data7.csv” that contains, among things, the title of each movie, the year of its release, and its box office gross.

metadata = pd.read_csv('../data/Pudding/meta_data7.csv', delimiter=',', encoding='utf-8')
metadata
script_id imdb_id title year gross lines_data
0 1534 tt1022603 (500) Days of Summer 2009 37.0 7435445256774780000000000000000000000000000000...
1 1512 tt0147800 10 Things I Hate About You 1999 65.0 1777752320274530000000000000000000000000000000...
2 1514 tt0417385 12 and Holding 2005 NaN 5461357777754220000000000000000000000000000000...
3 1517 tt2024544 12 Years a Slave 2013 60.0 4567334777777780000000000000000000000000000000...
4 1520 tt1542344 127 Hours 2010 20.0 453513352345766000000000000000
... ... ... ... ... ... ...
1995 3765 tt0403702 Youth in Revolt 2009 17.0 7766777656545340000000000000000000000000000000...
1996 3766 tt1790885 Zero Dark Thirty 2012 104.0 5677677556654470000000000000000000000000000000...
1997 8158 tt0120906 Zero Effect 1998 3.0 4777774477777650000000000000000000000000000000...
1998 3768 tt0421090 Zerophilia 2005 NaN 7647774446763350000000000000000000000000000000...
1999 6491 tt0443706 Zodiac 2007 41.0 2447767766763570000000000000000000000000000000...

2000 rows × 6 columns

We’re going to drop the column “lines_data”, which contains information about when during the film each character speaks.

metadata = metadata.drop(columns='lines_data')

They published another CSV file called “character_list5.csv” that contains, among other things, the name, gender, and age of each character as well as the number of words the character speaks.

characters = pd.read_csv('../data/Pudding/character_list5.csv', delimiter=',',encoding='utf-8')
characters
script_id imdb_character_name words gender age
0 280 betty 311 f 35.0
1 280 carolyn johnson 873 f NaN
2 280 eleanor 138 f NaN
3 280 francesca johns 2251 f 46.0
4 280 madge 190 f 46.0
... ... ... ... ... ...
23043 9254 lumiere 1063 m 56.0
23044 9254 maurice 1107 m 71.0
23045 9254 monsieur d'arqu 114 m 58.0
23046 9254 mrs. potts 564 f 66.0
23047 9254 wardrobe 121 f 54.0

23048 rows × 5 columns

As you can see, the characters DataFrame doesn’t include the actual title of the movie in which the character appears or the movie’s release year or box office gross. And the metadata datafram doesn’t contain any information about the characters. We want that info all in one place. So how can we combine all of this data together?

Merge Datasets#

If you look closely, there’s one column that both datasets share in common: “script_id”. If two datasets share at least one column in common, we can merge them together based on this column.

We can use the pd.merge() function and type in the name of the first dataframe, the name of the second dataframe, and the shared column to be merged on.

pd.merge(characters, metadata, on='script_id')
script_id imdb_character_name words gender age imdb_id title year gross lines_data
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 4332023434343450000000000000000000000000000000...
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 4332023434343450000000000000000000000000000000...
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 4332023434343450000000000000000000000000000000...
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 4332023434343450000000000000000000000000000000...
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 4332023434343450000000000000000000000000000000...
... ... ... ... ... ... ... ... ... ... ...
23043 9254 lumiere 1063 m 56.0 tt0101414 Beauty and the Beast 1991 452.0 3245753334377770000000000000000000000000000000...
23044 9254 maurice 1107 m 71.0 tt0101414 Beauty and the Beast 1991 452.0 3245753334377770000000000000000000000000000000...
23045 9254 monsieur d'arqu 114 m 58.0 tt0101414 Beauty and the Beast 1991 452.0 3245753334377770000000000000000000000000000000...
23046 9254 mrs. potts 564 f 66.0 tt0101414 Beauty and the Beast 1991 452.0 3245753334377770000000000000000000000000000000...
23047 9254 wardrobe 121 f 54.0 tt0101414 Beauty and the Beast 1991 452.0 3245753334377770000000000000000000000000000000...

23048 rows × 10 columns

Now we have a combined DataFrame with character and film information.

merged_movie_character = pd.merge(characters, metadata, on='script_id')

Calculate Dialogue Proportions#

We’re going to add one more column to this dataset before the next lesson. We’re going to calculate the proportion of words spoken in each film by each character. To do so, we’re going to .groupby() the movie’s title and calculate the sum total number of words spoken in each movie.

merged_movie_character.groupby('title')[['words']].sum()
words
title
(500) Days of Summer 18500
10 Things I Hate About You 19680
12 Years a Slave 19628
12 and Holding 15968
127 Hours 5145
... ...
Zero Effect 13927
Zerophilia 16686
Zodiac 14656
eXistenZ 9447
xXx 8285

1994 rows × 1 columns

If we use the .transform(), we can turn this groupby into a single column of data.

merged_movie_character.groupby(['title'])[['words']].transform(sum)
words
0 6394
1 6394
2 6394
3 6394
4 6394
... ...
23043 10159
23044 10159
23045 10159
23046 10159
23047 10159

23048 rows × 1 columns

total_movie_words = merged_movie_character.groupby(['title'])[['words']].transform(sum)

Then we’re going to divide the total number of words spoken by each character by the total number of words spoken in each film.

total_character_words = merged_movie_character[['words']]
total_character_words / total_movie_words
words
0 0.048639
1 0.136534
2 0.021583
3 0.352049
4 0.029715
... ...
23043 0.104636
23044 0.108967
23045 0.011222
23046 0.055517
23047 0.011911

23048 rows × 1 columns

dialogue_proportion = total_character_words / total_movie_words

Then we’re going to add it as a new column.

merged_movie_character['proportion_of_dialogue'] = dialogue_proportion
merged_movie_character
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715
... ... ... ... ... ... ... ... ... ... ...
23043 9254 lumiere 1063 m 56.0 tt0101414 Beauty and the Beast 1991 452.0 0.104636
23044 9254 maurice 1107 m 71.0 tt0101414 Beauty and the Beast 1991 452.0 0.108967
23045 9254 monsieur d'arqu 114 m 58.0 tt0101414 Beauty and the Beast 1991 452.0 0.011222
23046 9254 mrs. potts 564 f 66.0 tt0101414 Beauty and the Beast 1991 452.0 0.055517
23047 9254 wardrobe 121 f 54.0 tt0101414 Beauty and the Beast 1991 452.0 0.011911

23048 rows × 10 columns

Write to CSV File#

Finally, we’re going to output this merged and more comprehensive dataset to a CSV file by using the .to_csv() method. We set the index parameter to False to remove the index column (the numbers in the left-most column).

merged_movie_character.to_csv('../data/Pudding/Merged-Pudding-Film-Dialogue.csv', encoding='utf-8', index=False)