Pandas Basics — Part 3

In this lesson, we’re going to introduce more fundamentals of Pandas, a powerful Python library for working with tabular data like CSV files.

We will review skills learned from the last two lessons and introduce how to:

  • Check for duplicate data

  • Clean and transform data

  • Manipulate string data

  • Apply functions

  • Reset index

  • Bonus! Create an interactive data viz


Dataset

The Pudding’s Film Dialogue Data

Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles. But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?

—Hannah Anderson and Matt Daniels, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”

The dataset that we’re working with in this lesson is taken from Hannah Andersen and Matt Daniels’s Pudding essay, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”. The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how “white men dominate movie roles.”

Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They claim, in fact, that one of the most frequently asked questions about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!” The short answer is that they don’t. To determine character gender, they used actors’ IMDB information, which they acknowledge is an imperfect approach: “Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.”

As we work with this data, we want to be critical and cognizant of this approach to gender. How does such a binary understanding of gender, gleaned from the IMDB pages of actors, influence our later results and conclusions? What do we gain by using such an approach, and what do we lose? How else might we have encoded or determined gender for the same data?


Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

The above import statement not only imports the Pandas library but also gives it an alias or nickname — pd. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd.

Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 100

Read in CSV File

To read in a CSV file, we will use the function pd.read_csv() and insert the name of our desired file path.

film_df = pd.read_csv('../data/Pudding/Pudding-Film-Dialogue-Salty.csv', delimiter=",", encoding='utf-8')

When reading in the CSV file, we also specified the encoding and delimiter. The delimiter parameter specifies the character that separates or “delimits” the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for Comma Separated Values.) Sometimes, however, the delimiter of a CSV file might be a tab (/t) or, more rarely, another character.

Display Data

We can display a DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.

film_df
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715
... ... ... ... ... ... ... ... ... ... ...
23047 9254 maurice 1107 m 71.0 tt0101414 Beauty and the Beast 1991 452.0 0.108967
23048 9254 monsieur d'arqu 114 m 58.0 tt0101414 Beauty and the Beast 1991 452.0 0.011222
23049 9254 mrs. potts 564 f 66.0 tt0101414 Beauty and the Beast 1991 452.0 0.055517
23050 9254 wardrobe 121 f 54.0 tt0101414 Beauty and the Beast 1991 452.0 0.011911
23051 9254 mrs. potts 564 f 66.0 tt0101414 Beauty and the Beast 1991 452.0 0.055517

23052 rows × 10 columns

There are a few important things to note about the DataFrame displayed here:

  • Index

    • The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.

    • By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

  • Truncation

    • The DataFrame is truncated, signaled by the ellipses in the middle ... of every column.

    • The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.

  • Rows x Columns

    • Pandas reports how many rows and columns are in this dataset at the bottom of the output (23,052 x 10 columns).

Display First n Rows

To look at the first n rows in a DataFrame, we can use a method called .head().

film_df.head(10)
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715
5 280 michael johnson 723 m 38.0 tt0112579 The Bridges of Madison County 1995 142.0 0.113075
6 280 robert kincaid 1908 m 65.0 tt0112579 The Bridges of Madison County 1995 142.0 0.298405
7 623 bobby korfin 328 m NaN tt0179626 15 Minutes 2001 37.0 0.036012
8 623 daphne handlova 409 f 28.0 tt0179626 15 Minutes 2001 37.0 0.044906
9 623 deputy chief fi 347 m NaN tt0179626 15 Minutes 2001 37.0 0.038098

Display Random Sample

To look at a random sample of rows, we can use the .sample() method.

film_df.sample(10)
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
5457 2011 teresa 198 f NaN tt0095765 Cinema Paradiso 1988 NaN 0.009360
17698 6518 father fonescu 313 m 53.0 tt0085780 The Keep 1983 NaN 0.080442
20073 7740 delta 294 m NaN tt3397884 Sicario 2015 48.0 0.042646
15873 5380 big brother 122 m NaN tt1365050 Beasts of No Nation 2015 NaN 0.018093
20804 8021 eugene moloney 174 m NaN tt1540133 The Guard 2011 5.0 0.017896
22244 8683 marlon 174 m 31.0 tt0375173 Alfie 2004 18.0 0.024227
18137 6798 mr. page 408 m 45.0 tt0091306 Jumpin' Jack Flash 1986 60.0 0.060435
13429 4422 aubrey daniel 254 m 50.0 tt1130080 The Informant! 2009 38.0 0.020721
6836 2368 charlie andrews 541 m 33.0 tt0083987 Gandhi 1982 NaN 0.038615
3754 1547 anchorman 164 m NaN tt0096754 The Abyss 1989 118.0 0.010511

Examine Data

Shape

To explicitly check for how many rows vs columns make up a dataset, we can use the .shape method.

film_df.shape
(23052, 10)

There are 23,052 rows and 10 columns.

Data Types

Just like Python has different data types, Pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the .dtypes method.

Pandas Data Type

Explanation

object

string

float64

float

int64

integer

datetime64

date time

film_df.dtypes
script_id                   int64
imdb_character_name        object
words                       int64
gender                     object
age                       float64
imdb_id                    object
title                      object
year                        int64
gross                     float64
proportion_of_dialogue    float64
dtype: object

It’s important to always check the data types in your DataFrame. For example, sometimes numeric values will accidentally be interpreted as a string object. To perform calculations on this data, you would need to first convert that column from a string to an integer.

Columns

We can also check the column names of the DataFrame with .columns

film_df.columns
Index(['script_id', 'imdb_character_name', 'words', 'gender', 'age', 'imdb_id',
       'title', 'year', 'gross', 'proportion_of_dialogue'],
      dtype='object')

Summary Statistics

film_df.describe(include='all')
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
count 23052.000000 23050 23052.000000 23052 18266.000000 23052 23052 23052.000000 19390.000000 23052.000000
unique NaN 17613 NaN 3 NaN 2000 1994 NaN NaN NaN
top NaN doctor NaN m NaN tt0116905 Lone Star NaN NaN NaN
freq NaN 36 NaN 16135 NaN 40 40 NaN NaN NaN
mean 4195.580904 NaN 907.870684 NaN 42.386839 NaN NaN 1998.133828 106.805570 0.086512
std 2473.518048 NaN 1399.538583 NaN 59.714730 NaN NaN 14.745342 145.992394 0.107740
min 280.000000 NaN 101.000000 NaN 3.000000 NaN NaN 1929.000000 0.000000 0.001537
25% 2095.000000 NaN 193.000000 NaN 30.000000 NaN NaN 1992.000000 22.000000 0.019771
50% 3694.000000 NaN 396.000000 NaN 39.000000 NaN NaN 2001.000000 56.000000 0.042421
75% 6234.000000 NaN 980.000000 NaN 50.000000 NaN NaN 2009.000000 136.000000 0.104166
max 9254.000000 NaN 28102.000000 NaN 2013.000000 NaN NaN 2015.000000 1798.000000 0.923422

Do you notice any outliers, anomalies, or potential problems here?

The maximum value in the “age” column is 2013! That seems like an error.

film_df[film_df['age'] == 2013]
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
11639 3737 lucas solomon 190 m 2013.0 tt0993846 The Wolf of Wall Street 2013 125.0 0.011077

Let’s drop this row from the dataset by using the .drop() method and the Index number of the row.

film_df = film_df.drop(11639)

Now if we look for it again, that row is gone.

film_df[film_df['age'] == 2013]
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue

Rename Columns

film_df = film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'})
film_df.head()
script_id character words gender age imdb_id title release_year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715

Drop Columns

film_df = film_df.drop(columns='imdb_id')

Missing Data

.isna() / .notna()

Pandas has special ways of dealing with missing data. As you may have already noticed, blank rows in a CSV file show up as NaN in a Pandas DataFrame.

To filter and count the number of missing/not missing values in a dataset, we can use the special .isna() and .notna() methods on a DataFrame or Series object.

The .isna() and .notna() methods return True/False pairs for each row, which we can use to filter the DataFrame for any rows that have information in a given column.

film_df[film_df['character'].isna()]
script_id character words gender age title release_year gross proportion_of_dialogue
4656 1807 NaN 146 f NaN Beloved 1998 42.0 0.005162
19448 7445 NaN 520 f NaN House of Games 1987 5.0 0.165184

This is important information for the sake of better understanding our dataset. But it’s also important because NaN values are treated as floats, not strings. If we tried to manipulate this column as text data, we would get an error. For this reason, we’re going to replace or “fill” these NaN values with the string “No Character Data” by using the .fillna() method.

film_df['character'] = film_df['character'].fillna('No Character Data')
film_df[film_df['character'].isna()]
script_id character words gender age title release_year gross proportion_of_dialogue

Check for Duplicates

Duplicates

We can check for duplicate rows by using the .duplicated() method and setting the parameter keep=False, which will display all the duplicated values in the dataset — rather than just the first duplicated value keep='first' or the last duplicated value keep='last'.

film_df.duplicated(keep=False)
0        False
1        False
2        False
3        False
4        False
         ...  
23047    False
23048    False
23049     True
23050    False
23051     True
Length: 23051, dtype: bool

The output above is reporting whether each row in the dataset is a duplicate. We can use the .duplicated() method inside a filter to isolate only the rows in the dataframe that are exact duplicates.

film_df[film_df.duplicated(keep=False)]
script_id character words gender age title release_year gross proportion_of_dialogue
21001 8099 c-3po 138 m 69.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.020702
21002 8099 c-3po 138 m 69.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.020702
21963 8560 kilgore trout 2673 m 63.0 Breakfast of Champions 1999 NaN 0.197211
21964 8560 kilgore trout 2673 m 63.0 Breakfast of Champions 1999 NaN 0.197211
22935 9222 groomsman #5 238 m NaN Wedding Crashers 2005 283.0 0.012032
22936 9222 groomsman #5 238 m NaN Wedding Crashers 2005 283.0 0.012032
23049 9254 mrs. potts 564 f 66.0 Beauty and the Beast 1991 452.0 0.055517
23051 9254 mrs. potts 564 f 66.0 Beauty and the Beast 1991 452.0 0.055517

We can drop duplicates from the DataFrame with the .drop_duplicates() method and choose to keep the first instance of the duplicate or the last instance.

film_df = film_df.drop_duplicates(keep='first')

Now if we check the data for duplicates again, they should be all gone.

film_df[film_df.duplicated(keep=False)]
script_id character words gender age title release_year gross proportion_of_dialogue

Clean and Transform Data

Pandas .str Methods

Remember all the special things that you can do with Python strings aka string methods?

Pandas has special Pandas string methods, too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add .str to the method chain.

Pandas String Method

Explanation

df[‘column_name’].str.lower()

makes the string in each row lowercase

df[‘column_name’].str.upper()

makes the string in each row uppercase

df[‘column_name’].str.title()

makes the string in each row titlecase

df[‘column_name’].str.replace('old string', 'new string')

replaces old string with new string for each row

df[‘column_name’].str.contains('some string')

tests whether string in each row contains “some string”

df[‘column_name’].str.split('delim')

returns a list of substrings separated by the given delimiter

df[‘column_name’].str.join(list)

opposite of split(), joins the elements in the given list together using the string

For example, to transform every character’s name in the “character” column from lowercase to uppercase, we can use .str.upper()

film_df['character'].str.upper()
0                  BETTY
1        CAROLYN JOHNSON
2                ELEANOR
3        FRANCESCA JOHNS
4                  MADGE
              ...       
23046            LUMIERE
23047            MAURICE
23048    MONSIEUR D'ARQU
23049         MRS. POTTS
23050           WARDROBE
Name: character, Length: 23047, dtype: object

To transform every character’s name in the “character” column to lowercase, we can use .str.lower()

film_df['character'].str.lower()
0                  betty
1        carolyn johnson
2                eleanor
3        francesca johns
4                  madge
              ...       
23046            lumiere
23047            maurice
23048    monsieur d'arqu
23049         mrs. potts
23050           wardrobe
Name: character, Length: 23047, dtype: object

If we want to replace the gender columns’s single letter abbreviation for “male” / “female” (sex) with “man” / “woman” (gender identity), we could use the .str.replace() method.

film_df['gender'] = film_df['gender'].str.replace('m', 'man')
film_df['gender'] = film_df['gender'].str.replace('f', 'woman')
film_df.sample(10)
script_id character words gender age title release_year gross proportion_of_dialogue
17657 6512 Megan 612 woman NaN Street Kings 2008 31.0 0.063459
15827 5362 Ginny 113 woman 36.0 Working Girl 1988 135.0 0.011978
3976 1634 Loni Packwood 107 man NaN American Outlaws 2001 20.0 0.009428
8193 2733 Natalie Calimer 139 woman 25.0 Larry Crowne 2011 39.0 0.010489
5046 1909 Wendell Matheso 284 man NaN The Box 2009 17.0 0.020325
10130 3285 Christine Hamil 184 woman 32.0 Scream 3 2000 143.0 0.011923
2593 1273 Mrs. Parks 143 woman 75.0 Repo Man 1984 NaN 0.022670
8384 2777 Clerk 208 man NaN Lincoln 2012 198.0 0.015018
18681 7130 Dottie Smith 1197 woman 22.0 Killer Joe 2011 2.0 0.132221
11727 3761 Bob Steegerson 780 man 39.0 You Can Count on Me 2000 14.0 0.034535

We can use the .str.contains() to search for particular words or phrases in a column, such as “Star Wars.”

film_df[film_df['title'].str.contains('Star Wars')]
script_id character words gender age title release_year gross proportion_of_dialogue
3017 1359 ADMIRAL ACKBAR 199 man 61.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.039096
3018 1359 BEN 'OBI-WAN' K 462 man 69.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.090766
3019 1359 C-3PO 881 man 37.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.173084
3020 1359 DARTH VADER 381 man 48.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.074853
3021 1359 HAN SOLO 835 man 41.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.164047
3022 1359 LANDO CALRISSIA 379 man 46.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.074460
3023 1359 LUKE SKYWALKER 915 man 32.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.179764
3024 1359 PRINCESS LEIA 359 woman 27.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.070530
3025 1359 THE EMPEROR 516 man 39.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.101375
3026 1359 YODA 163 man 19.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.032024
3027 1360 ANAKIN SKYWALKE 2061 man 21.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.274251
3028 1360 CLIEGG LARS 152 man 62.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.020226
3029 1360 COUNT DOOKU 321 man 80.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.042715
3030 1360 DEXTER JETTSTER 151 man 67.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.020093
3031 1360 JAR JAR BINKS 294 man 29.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.039122
3032 1360 LAMA SU 350 man 48.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.046574
3033 1360 MACE WINDU 365 man 54.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.048570
3034 1360 MADAME JOCASTA 331 woman 82.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.044045
3035 1360 OBI-WAN KENOBI 1477 man 31.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.196540
3036 1360 PADME 668 woman 21.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.088889
3037 1360 QUEEN JAMILLIA 135 woman 25.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.017964
3038 1360 SUPREME CHANCEL 576 man 58.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.076647
3039 1360 TAUN WE 139 woman 40.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.018496
3040 1360 WATTO 140 man 49.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.018629
3041 1360 YODA 355 man 58.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.047239
3058 1362 BEN OBI-WAN KEN 995 man 63.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.099084
3059 1362 C-3PO 1506 man 31.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.149970
3060 1362 CMDR. PRAJI (IM 117 man 42.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.011651
3061 1362 DARTH VADER 492 man 46.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.048994
3062 1362 GENERAL DODONNA 185 man 66.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.018423
3063 1362 GRAND MOFF TARK 411 man 64.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.040928
3064 1362 GREEDO 103 man 33.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.010257
3065 1362 HAN SOLO 1730 man 35.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.172276
3066 1362 JABBA 185 man NaN Star Wars: Episode IV - A New Hope 1977 1798.0 0.018423
3067 1362 LUKE SKYWALKER 2485 man 26.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.247461
3068 1362 PRINCESS LEIA O 636 woman 21.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.063334
3069 1362 RED LEADER 241 man 37.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.023999
3070 1362 RED THREE (BIGG 523 man 38.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.052081
3071 1362 STORMTROOPER 122 man 36.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.012149
3072 1362 UNCLE OWEN 311 man 61.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.030970
10591 3427 ANAKIN SKYWALKE 3768 man 24.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.271138
10592 3427 C-3PO 312 man 59.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.022451
10593 3427 CLONE JEDI KNIG 104 man NaN Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.007484
10594 3427 DARTH SLDIOUS 244 man NaN Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.017558
10595 3427 GENERAL GRIEVOU 317 man 33.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.022811
10596 3427 KI-ADI-MUNDI / 107 man 40.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.007700
10597 3427 MACE WINDU 407 man 57.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.029287
10598 3427 MON MOTHMA 127 woman 28.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.009139
10599 3427 OBI-WAN KENOBI 3181 man 34.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.228898
10600 3427 PADME 1561 woman 24.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.112326
10601 3427 SENATOR BAIL OR 479 man 50.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.034468
10602 3427 SUPREME CHANCEL 2367 man 61.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.170325
10603 3427 YODA 923 man 61.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.066417
10604 3433 ADMIRAL PIETT 232 man 43.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.025127
10605 3433 BEN (OBI-WAN) K 177 man 66.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.019170
10606 3433 C-3PO 1474 man 34.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.159645
10607 3433 CAPTAIN NEEDA 108 man 42.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.011697
10608 3433 DARTH VADER 853 man 49.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.092386
10609 3433 GENERAL VEERS 114 man 45.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.012347
10610 3433 HAN SOLO 2024 man 38.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.219214
10611 3433 LANDO CALRISSIA 798 man 43.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.086429
10612 3433 LUKE SKYWALKER 1282 man 29.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.138850
10613 3433 PRINCESS LEIA 1025 woman 24.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.111015
10614 3433 REBEL FORCE GEN 168 man 50.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.018196
10615 3433 SNOW CREATURE 227 man NaN Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.024586
10616 3433 YODA 751 man 36.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.081339
10617 3437 ANAKIN SKYWALKE 1697 man 10.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.158037
10618 3437 BOSS NASS 209 man 63.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.019464
10619 3437 CAPTAIN PANAKA 390 man 45.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.036320
10620 3437 CHANCELLOR VALO 127 man 61.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.011827
10621 3437 DARTH SIDIOUS 277 man NaN Star Wars: Episode I - The Phantom Menace 1999 813.0 0.025796
10622 3437 JAR JAR BINKS 632 man 26.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.058856
10623 3437 MACE WINDU 247 man 51.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.023002
10624 3437 NUTE GUNRAY 544 man 34.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.050661
10625 3437 OBI-WAN KENOBI 701 man 28.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.065282
10626 3437 PADME 718 woman 18.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.066865
10627 3437 QUEEN AMIDALA 656 woman 18.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.061091
10628 3437 QUI-GON JINN 2339 man 47.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.217825
10629 3437 RIC OLI� 132 man 42.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.012293
10630 3437 RUNE HAAKO 126 man NaN Star Wars: Episode I - The Phantom Menace 1999 813.0 0.011734
10631 3437 SENATOR PALPATI 551 man 55.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.051313
10632 3437 SHMI SKYWALKER 414 woman 41.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.038555
10633 3437 SIO BIBBLE 164 man 60.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.015273
10634 3437 WATTO 502 man 46.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.046750
10635 3437 YODA 312 man 55.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.029056
21000 8099 BALA-TIK 140 man NaN Star Wars: Episode VII - The Force Awakens 2015 927.0 0.021002
21001 8099 C-3PO 138 man 69.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.020702
21003 8099 FINN 1447 man 23.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.217072
21004 8099 GENERAL HUX 356 man 32.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.053405
21005 8099 HAN SOLO 1225 man 73.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.183768
21006 8099 KYLO REN 618 man 32.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.092709
21007 8099 MAZ KANATA 241 woman 32.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.036154
21008 8099 POE DAMERON 604 man 36.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.090609
21009 8099 PRINCESS LEIA 252 woman 59.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.037804
21010 8099 REY 1363 woman 23.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.204470
21011 8099 STORMTROOPER 114 man NaN Star Wars: Episode VII - The Force Awakens 2015 927.0 0.017102
21012 8099 SUPREME LEADER 168 man 51.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.025203

We can use the .str.contains() to search for particular words or phrases in a column, such as “Mean Girls.”

film_df[film_df['title'].str.contains('Mean Girls')]
script_id character words gender age title release_year gross proportion_of_dialogue
13938 4605 AARON SAMUELS 426 man 23.0 Mean Girls 2004 120.0 0.053890
13939 4605 CADY HERON 2798 woman 18.0 Mean Girls 2004 120.0 0.353953
13940 4605 DAMIAN 624 man 26.0 Mean Girls 2004 120.0 0.078937
13941 4605 GRETCHEN WIENER 609 woman 22.0 Mean Girls 2004 120.0 0.077040
13942 4605 JANIS IAN 907 woman 22.0 Mean Girls 2004 120.0 0.114738
13943 4605 KAREN SMITH 301 woman 19.0 Mean Girls 2004 120.0 0.038077
13944 4605 MR. DUVALL 365 man 43.0 Mean Girls 2004 120.0 0.046173
13945 4605 MRS. GEORGE 125 woman 33.0 Mean Girls 2004 120.0 0.015813
13946 4605 MS. NORBURY 720 woman 34.0 Mean Girls 2004 120.0 0.091082
13947 4605 REGINA GEORGE 1030 woman 26.0 Mean Girls 2004 120.0 0.130297

Applying Functions

With the .apply() method, we can run a function on every single row in a Pandas column or dataframe.

def make_text_title_case(text):
    title_case_text = text.title()
    return title_case_text
make_text_title_case("betty")
'Betty'
film_df['character'].apply(make_text_title_case)
0                  Betty
1        Carolyn Johnson
2                Eleanor
3        Francesca Johns
4                  Madge
              ...       
23047            Maurice
23048    Monsieur D'Arqu
23049         Mrs. Potts
23050           Wardrobe
23051         Mrs. Potts
Name: character, Length: 23052, dtype: object
film_df['character'] = film_df['character'].apply(make_text_title_case)
film_df.sample(10)
script_id character words gender age title release_year gross proportion_of_dialogue
15499 5179 Jack Baer 641 man 37.0 The Usual Suspects 1995 46.0 0.054539
14997 5006 Mac 604 man NaN Spacejacked 1997 NaN 0.077855
6991 2405 Young Woman On 104 woman NaN Ghostbusters II 1989 245.0 0.004559
3569 1502 Miss Bishop 130 woman 48.0 42 2013 101.0 0.007753
6922 2384 Nicki 152 woman 33.0 Get Shorty 1995 144.0 0.006989
20636 7942 Er Doctor #1 123 man NaN Stranger Than Fiction 2006 53.0 0.012025
14856 4960 Hannah 371 woman 26.0 Silverado 1985 81.0 0.046661
8405 2779 Cecil Dobbs 249 man 66.0 The Lincoln Lawyer 2011 63.0 0.012808
13970 4608 Nobu 574 man 49.0 Memoirs of a Geisha 2005 77.0 0.097453
20046 7733 Kyle Reese 2194 man 29.0 Terminator Genisys 2015 92.0 0.255413

Filter DataFrame

We can filter the DataFrames for only characters who are men or women.

men_film_df = film_df[film_df['gender'] == 'man']
women_film_df = film_df[film_df['gender'] == 'woman']

Groupby

We can use the .groupby() function to group all the men characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of men speaking.

men_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]
proportion_of_dialogue
title
Breakfast of Champions 1.070828
Kagemusha 1.000000
The Men Who Stare at Goats 1.000000
The Wild Bunch 1.000000
Killing Them Softly 1.000000
Fury 1.000000
The Hunt for Red October 1.000000
Stalag 17 1.000000
Die Hard: With a Vengeance 1.000000
There Will Be Blood 1.000000
Jeremiah Johnson 1.000000
Saving Private Ryan 1.000000
Crimson Tide 1.000000
The Revenant 1.000000
Platoon 1.000000
Duel 1.000000
Conquest of the Planet of the Apes 1.000000
The Shawshank Redemption 1.000000
Mala Noche 1.000000
Schindler's List 1.000000

We can use the .groupby() function to group all the women characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of women speaking.

women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]
proportion_of_dialogue
title
Now and Then 1.000000
The Descent 1.000000
Precious 0.993541
Martyrs 0.965550
The Hand That Rocks the Cradle 0.933750
Agnes of God 0.922482
Heavenly Creatures 0.919368
The Help 0.916947
3 Women 0.899530
The Watermelon Woman 0.894676
Grandma 0.894137
The Craft 0.891866
Easy A 0.888131
The Others 0.879972
Black Christmas 0.877467
Wadjda 0.870567
For Colored Girls 0.870247
The Roommate 0.869879
Mrs. Winterbourne 0.864730
Practical Magic 0.856693

Reset Index

We can transform a Groupby object into a DataFrame with a regular Index by tacking on .reset_index().

women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False).reset_index()
title proportion_of_dialogue
0 Now and Then 1.000000
1 The Descent 1.000000
2 Precious 0.993541
3 Martyrs 0.965550
4 The Hand That Rocks the Cradle 0.933750
... ... ...
1935 The Last Castle 0.011139
1936 The Damned United 0.010909
1937 Thirteen Days 0.010834
1938 Men in Black 3 0.007812
1939 Full Metal Jacket 0.007170

1940 rows × 2 columns

Bonus — Interactive Data Visualization

#Set up Bokeh to work in Jupyter notebook
output_notebook()
Loading BokehJS ...

Here’s an interactive data visualization of these Hollywood films by release year and percentage of women dialogue. This data viz was created with the Python library Bokeh, which we will discuss in some later lessons.

  • Scroll to zoom in

  • Hover to see more information about each point

To see the code that created this visualization, select “Click to show” below.

#Make groupby into a new DataFrame
dialogue_df = women_film_df.groupby(['title', 'release_year'])[['proportion_of_dialogue']].sum().sort_values(by='proportion_of_dialogue', ascending=False).reset_index()

#Import necessary Bokeh bodules
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, NumeralTickFormatter
from bokeh.io import output_notebook, show
from bokeh.palettes import RdBu
from bokeh.transform import linear_cmap, jitter

# Set up the source data that will suppply the x,y columns and the film title hover text
source = ColumnDataSource(dialogue_df)

# Set the hover tool tip to the film title, release year, and proportion of dialogue

TOOLTIPS = [("Title", "@title"),
            ("Year", "@release_year"),
           ("Women Dialogue", "@{proportion_of_dialogue}{%0.2f}")]

#Set up Bokeh plot with title, labels, 
bokeh_plot = figure(title="How Much Do Women Speak in Hollywood Films?", x_axis_label = 'Film Release Year',
                    y_axis_label = 'Amount of Dialogue Spoken By Women',x_range = [1930, 2018], y_range = [0, 1.01],
                 tooltips=TOOLTIPS, width=900, height=850, active_scroll='wheel_zoom')

# Create a red to blue color palette
color_mapper = linear_cmap(field_name='proportion_of_dialogue', palette=RdBu[4], low=1.1, high=0)

# Supply inidivudal points values
bokeh_plot.circle(y='proportion_of_dialogue', x=jitter('release_year', width=.2),
         size = 20,
        line_color='black',
        line_alpha=.4,
         source=source,
         color=color_mapper, alpha=.5)

bokeh_plot.title.text_font_size='20pt'

#Make Y axis percentages
bokeh_plot.yaxis.formatter = NumeralTickFormatter(format='0 %')

show(bokeh_plot)