Pandas Basics — Part 3#

Note: You can explore the associated workbook for this chapter in the cloud.

In this lesson, we’re going to introduce more fundamentals of Pandas, a powerful Python library for working with tabular data like CSV files.

We will review skills learned from the last two lessons and introduce how to:

  • Check for duplicate data

  • Clean and transform data

  • Manipulate string data

  • Apply functions

  • Reset index

  • Bonus! Create an interactive data viz


Dataset#

The Pudding’s Film Dialogue Data#

Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles.

But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?

-Hannah Anderson and Matt Daniels, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”

The dataset that we’re working with in this lesson is taken from Hannah Andersen and Matt Daniels’s Pudding essay, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”. The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how “white men dominate movie roles.”

Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They claim, in fact, that one of the most frequently asked questions about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!” The short answer is that they don’t. To determine character gender, they used actors’ IMDB information, which they acknowledge is an imperfect approach: “Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.”

As we work with this data, we want to be critical and cognizant of this approach to gender. How does such a binary understanding of gender, gleaned from the IMDB pages of actors, influence our later results and conclusions? What do we gain by using such an approach, and what do we lose? How else might we have encoded or determined gender for the same data?


Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

The above import statement not only imports the Pandas library but also gives it an alias or nickname — pd. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd.

Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 100

Read in CSV File

To read in a CSV file, we will use the function pd.read_csv() and insert the name of our desired file path.

film_df = pd.read_csv('../data/Pudding/Pudding-Film-Dialogue-Salty.csv', delimiter=",", encoding='utf-8')

When reading in the CSV file, we also specified the encoding and delimiter. The delimiter parameter specifies the character that separates or “delimits” the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for Comma Separated Values.) Sometimes, however, the delimiter of a CSV file might be a tab (\t) or, more rarely, another character.

Display Data

We can display a DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.

Pandas Review

NaN is the Pandas value for any missing data. See “Working with missing data” for more information.

film_df
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715
... ... ... ... ... ... ... ... ... ... ...
23047 9254 maurice 1107 m 71.0 tt0101414 Beauty and the Beast 1991 452.0 0.108967
23048 9254 monsieur d'arqu 114 m 58.0 tt0101414 Beauty and the Beast 1991 452.0 0.011222
23049 9254 mrs. potts 564 f 66.0 tt0101414 Beauty and the Beast 1991 452.0 0.055517
23050 9254 wardrobe 121 f 54.0 tt0101414 Beauty and the Beast 1991 452.0 0.011911
23051 9254 mrs. potts 564 f 66.0 tt0101414 Beauty and the Beast 1991 452.0 0.055517

23052 rows × 10 columns

There are a few important things to note about the DataFrame displayed here:

  • Index

    • The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.

    • By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

  • Truncation

    • The DataFrame is truncated, signaled by the ellipses in the middle ... of every column.

    • The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.

  • Rows x Columns

    • Pandas reports how many rows and columns are in this dataset at the bottom of the output (23,052 x 10 columns).

Display First n Rows

To look at the first n rows in a DataFrame, we can use a method called .head().

film_df.head(10)
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715
5 280 michael johnson 723 m 38.0 tt0112579 The Bridges of Madison County 1995 142.0 0.113075
6 280 robert kincaid 1908 m 65.0 tt0112579 The Bridges of Madison County 1995 142.0 0.298405
7 623 bobby korfin 328 m NaN tt0179626 15 Minutes 2001 37.0 0.036012
8 623 daphne handlova 409 f 28.0 tt0179626 15 Minutes 2001 37.0 0.044906
9 623 deputy chief fi 347 m NaN tt0179626 15 Minutes 2001 37.0 0.038098

Display Random Sample

To look at a random sample of rows, we can use the .sample() method.

film_df.sample(10)
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
3519 1483 auntie em 295 f 63.0 tt0032138 The Wizard of Oz 1939 839.0 0.031426
5395 1991 holden mcneil 9462 m 25.0 tt0118842 Chasing Amy 1997 22.0 0.300973
5238 1957 lightning mcque 1968 m 43.0 tt1216475 Cars 2 2011 209.0 0.155622
2992 1350 tariq asani 267 m 31.0 tt0360009 Spartan 2004 6.0 0.043337
18281 6858 mara 568 f NaN tt0114852 Village of the Damned 1995 18.0 0.099006
8333 2764 nurse #1 162 f 37.0 tt0264796 Life as a House 2001 23.0 0.007482
13345 4394 jonathan 1168 m 47.0 tt1691920 I Melt with You 2011 NaN 0.159041
16889 6053 bomber 392 m NaN tt1532958 Battle of the Year 2013 9.0 0.035617
15704 5297 mr. green 825 m 38.0 tt0088930 Clue 1985 7.0 0.087701
75 642 kurt zagon 914 m 60.0 tt0094602 Above the Law 1988 40.0 0.116522

Examine Data

Shape

To explicitly check for how many rows vs columns make up a dataset, we can use the .shape method.

film_df.shape
(23052, 10)

There are 23,052 rows and 10 columns.

Data Types

Just like Python has different data types, Pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the .dtypes method.

Pandas Data Type

Explanation

object

string

float64

float

int64

integer

datetime64

date time

film_df.dtypes
script_id                   int64
imdb_character_name        object
words                       int64
gender                     object
age                       float64
imdb_id                    object
title                      object
year                        int64
gross                     float64
proportion_of_dialogue    float64
dtype: object

It’s important to always check the data types in your DataFrame. For example, sometimes numeric values will accidentally be interpreted as a string object. To perform calculations on this data, you would need to first convert that column from a string to an integer.

Columns

We can also check the column names of the DataFrame with .columns

film_df.columns
Index(['script_id', 'imdb_character_name', 'words', 'gender', 'age', 'imdb_id',
       'title', 'year', 'gross', 'proportion_of_dialogue'],
      dtype='object')

Summary Statistics

film_df.describe(include='all')
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
count 23052.000000 23050 23052.000000 23052 18266.000000 23052 23052 23052.000000 19390.000000 23052.000000
unique NaN 17613 NaN 3 NaN 2000 1994 NaN NaN NaN
top NaN doctor NaN m NaN tt0116905 Lone Star NaN NaN NaN
freq NaN 36 NaN 16135 NaN 40 40 NaN NaN NaN
mean 4195.580904 NaN 907.870684 NaN 42.386839 NaN NaN 1998.133828 106.805570 0.086512
std 2473.518048 NaN 1399.538583 NaN 59.714730 NaN NaN 14.745342 145.992394 0.107740
min 280.000000 NaN 101.000000 NaN 3.000000 NaN NaN 1929.000000 0.000000 0.001537
25% 2095.000000 NaN 193.000000 NaN 30.000000 NaN NaN 1992.000000 22.000000 0.019771
50% 3694.000000 NaN 396.000000 NaN 39.000000 NaN NaN 2001.000000 56.000000 0.042421
75% 6234.000000 NaN 980.000000 NaN 50.000000 NaN NaN 2009.000000 136.000000 0.104166
max 9254.000000 NaN 28102.000000 NaN 2013.000000 NaN NaN 2015.000000 1798.000000 0.923422

Do you notice any outliers, anomalies, or potential problems here?

The maximum value in the “age” column is 2013! That seems like an error.

film_df[film_df['age'] == 2013]
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue
11639 3737 lucas solomon 190 m 2013.0 tt0993846 The Wolf of Wall Street 2013 125.0 0.011077

Let’s drop this row from the dataset by using the .drop() method and the Index number of the row.

film_df = film_df.drop(11639)

Now if we look for it again, that row is gone.

film_df[film_df['age'] == 2013]
script_id imdb_character_name words gender age imdb_id title year gross proportion_of_dialogue

Rename Columns

film_df = film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'})
film_df.head()
script_id character words gender age imdb_id title release_year gross proportion_of_dialogue
0 280 betty 311 f 35.0 tt0112579 The Bridges of Madison County 1995 142.0 0.048639
1 280 carolyn johnson 873 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.136534
2 280 eleanor 138 f NaN tt0112579 The Bridges of Madison County 1995 142.0 0.021583
3 280 francesca johns 2251 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.352049
4 280 madge 190 f 46.0 tt0112579 The Bridges of Madison County 1995 142.0 0.029715

Drop Columns

film_df = film_df.drop(columns='imdb_id')

Missing Data

.isna() / .notna()

Pandas has special ways of dealing with missing data. As you may have already noticed, blank rows in a CSV file show up as NaN in a Pandas DataFrame.

To filter and count the number of missing/not missing values in a dataset, we can use the special .isna() and .notna() methods on a DataFrame or Series object.

The .isna() and .notna() methods return True/False pairs for each row, which we can use to filter the DataFrame for any rows that have information in a given column.

film_df[film_df['character'].isna()]
script_id character words gender age title release_year gross proportion_of_dialogue
4656 1807 NaN 146 f NaN Beloved 1998 42.0 0.005162
19448 7445 NaN 520 f NaN House of Games 1987 5.0 0.165184

This is important information for the sake of better understanding our dataset. But it’s also important because NaN values are treated as floats, not strings. If we tried to manipulate this column as text data, we would get an error. For this reason, we’re going to replace or “fill” these NaN values with the string “No Character Data” by using the .fillna() method.

film_df['character'] = film_df['character'].fillna('No Character Data')
film_df[film_df['character'].isna()]
script_id character words gender age title release_year gross proportion_of_dialogue

Check for Duplicates#

Duplicates#

We can check for duplicate rows by using the .duplicated() method and setting the parameter keep=False, which will display all the duplicated values in the dataset — rather than just the first duplicated value keep='first' or the last duplicated value keep='last'.

film_df.duplicated(keep=False)
0        False
1        False
2        False
3        False
4        False
         ...  
23047    False
23048    False
23049     True
23050    False
23051     True
Length: 23051, dtype: bool

The output above is reporting whether each row in the dataset is a duplicate. We can use the .duplicated() method inside a filter to isolate only the rows in the dataframe that are exact duplicates.

film_df[film_df.duplicated(keep=False)]
script_id character words gender age title release_year gross proportion_of_dialogue
21001 8099 c-3po 138 m 69.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.020702
21002 8099 c-3po 138 m 69.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.020702
21963 8560 kilgore trout 2673 m 63.0 Breakfast of Champions 1999 NaN 0.197211
21964 8560 kilgore trout 2673 m 63.0 Breakfast of Champions 1999 NaN 0.197211
22935 9222 groomsman #5 238 m NaN Wedding Crashers 2005 283.0 0.012032
22936 9222 groomsman #5 238 m NaN Wedding Crashers 2005 283.0 0.012032
23049 9254 mrs. potts 564 f 66.0 Beauty and the Beast 1991 452.0 0.055517
23051 9254 mrs. potts 564 f 66.0 Beauty and the Beast 1991 452.0 0.055517

We can drop duplicates from the DataFrame with the .drop_duplicates() method and choose to keep the first instance of the duplicate or the last instance.

film_df = film_df.drop_duplicates(keep='first')

Now if we check the data for duplicates again, they should be all gone.

film_df[film_df.duplicated(keep=False)]
script_id character words gender age title release_year gross proportion_of_dialogue

Clean and Transform Data#

Pandas .str Methods#

Remember all the special things that you can do with Python strings aka string methods?

Pandas has special Pandas string methods, too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add .str to the method chain.

Pandas String Method

Explanation

df[‘column_name’].str.lower()

makes the string in each row lowercase

df[‘column_name’].str.upper()

makes the string in each row uppercase

df[‘column_name’].str.title()

makes the string in each row titlecase

df[‘column_name’].str.replace('old string', 'new string')

replaces old string with new string for each row

df[‘column_name’].str.contains('some string')

tests whether string in each row contains “some string”

df[‘column_name’].str.split('delim')

returns a list of substrings separated by the given delimiter

df[‘column_name’].str.join(list)

opposite of split(), joins the elements in the given list together using the string

For example, to transform every character’s name in the “character” column from lowercase to uppercase, we can use .str.upper()

film_df['character'].str.upper()
0                  BETTY
1        CAROLYN JOHNSON
2                ELEANOR
3        FRANCESCA JOHNS
4                  MADGE
              ...       
23046            LUMIERE
23047            MAURICE
23048    MONSIEUR D'ARQU
23049         MRS. POTTS
23050           WARDROBE
Name: character, Length: 23047, dtype: object

To transform every character’s name in the “character” column to lowercase, we can use .str.lower()

film_df['character'].str.lower()
0                  betty
1        carolyn johnson
2                eleanor
3        francesca johns
4                  madge
              ...       
23046            lumiere
23047            maurice
23048    monsieur d'arqu
23049         mrs. potts
23050           wardrobe
Name: character, Length: 23047, dtype: object

If we want to replace the gender columns’s single letter abbreviation for “male” / “female” (sex) with “man” / “woman” (gender identity), we could use the .str.replace() method.

film_df['gender'] = film_df['gender'].str.replace('m', 'man')
film_df['gender'] = film_df['gender'].str.replace('f', 'woman')
film_df.sample(10)
script_id character words gender age title release_year gross proportion_of_dialogue
18526 7014 julie 256 woman 22.0 Showgirls 1995 40.0 0.023103
10160 3289 charlie mccorry 584 man 40.0 The Searchers 1956 NaN 0.036878
15578 5214 guy 163 man NaN The Wedding Singer 1998 148.0 0.011877
649 793 ellis 223 man NaN Cellular 2004 44.0 0.023130
892 851 peterson 207 man NaN Deception 2008 5.0 0.024247
1190 915 clear rivers 472 woman 24.0 Final Destination 2000 86.0 0.163946
3032 1360 lama su 350 man 48.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.046574
10153 3288 somerset 8838 man 58.0 Se7en 1995 200.0 0.428281
10421 3352 anna crowe 916 woman 31.0 The Sixth Sense 1999 503.0 0.080549
7132 2442 noah 160 man 32.0 The Grapes of Wrath 1940 NaN 0.006841

We can use the .str.contains() to search for particular words or phrases in a column, such as “Star Wars.”

film_df[film_df['title'].str.contains('Star Wars')]
script_id character words gender age title release_year gross proportion_of_dialogue
3017 1359 admiral ackbar 199 man 61.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.039096
3018 1359 ben 'obi-wan' k 462 man 69.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.090766
3019 1359 c-3po 881 man 37.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.173084
3020 1359 darth vader 381 man 48.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.074853
3021 1359 han solo 835 man 41.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.164047
3022 1359 lando calrissia 379 man 46.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.074460
3023 1359 luke skywalker 915 man 32.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.179764
3024 1359 princess leia 359 woman 27.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.070530
3025 1359 the emperor 516 man 39.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.101375
3026 1359 yoda 163 man 19.0 Star Wars: Episode VI - Return of the Jedi 1983 853.0 0.032024
3027 1360 anakin skywalke 2061 man 21.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.274251
3028 1360 cliegg lars 152 man 62.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.020226
3029 1360 count dooku 321 man 80.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.042715
3030 1360 dexter jettster 151 man 67.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.020093
3031 1360 jar jar binks 294 man 29.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.039122
3032 1360 lama su 350 man 48.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.046574
3033 1360 mace windu 365 man 54.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.048570
3034 1360 madame jocasta 331 woman 82.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.044045
3035 1360 obi-wan kenobi 1477 man 31.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.196540
3036 1360 padme 668 woman 21.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.088889
3037 1360 queen jamillia 135 woman 25.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.017964
3038 1360 supreme chancel 576 man 58.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.076647
3039 1360 taun we 139 woman 40.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.018496
3040 1360 watto 140 man 49.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.018629
3041 1360 yoda 355 man 58.0 Star Wars: Episode II - Attack of the Clones 2002 465.0 0.047239
3058 1362 ben obi-wan ken 995 man 63.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.099084
3059 1362 c-3po 1506 man 31.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.149970
3060 1362 cmdr. praji (im 117 man 42.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.011651
3061 1362 darth vader 492 man 46.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.048994
3062 1362 general dodonna 185 man 66.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.018423
3063 1362 grand moff tark 411 man 64.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.040928
3064 1362 greedo 103 man 33.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.010257
3065 1362 han solo 1730 man 35.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.172276
3066 1362 jabba 185 man NaN Star Wars: Episode IV - A New Hope 1977 1798.0 0.018423
3067 1362 luke skywalker 2485 man 26.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.247461
3068 1362 princess leia o 636 woman 21.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.063334
3069 1362 red leader 241 man 37.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.023999
3070 1362 red three (bigg 523 man 38.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.052081
3071 1362 stormtrooper 122 man 36.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.012149
3072 1362 uncle owen 311 man 61.0 Star Wars: Episode IV - A New Hope 1977 1798.0 0.030970
10591 3427 anakin skywalke 3768 man 24.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.271138
10592 3427 c-3po 312 man 59.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.022451
10593 3427 Clone Jedi Knig 104 man NaN Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.007484
10594 3427 darth sldious 244 man NaN Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.017558
10595 3427 general grievou 317 man 33.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.022811
10596 3427 Ki-Adi-Mundi / 107 man 40.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.007700
10597 3427 Mace Windu 407 man 57.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.029287
10598 3427 mon mothma 127 woman 28.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.009139
10599 3427 obi-wan kenobi 3181 man 34.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.228898
10600 3427 Padme 1561 woman 24.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.112326
10601 3427 senator bail or 479 man 50.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.034468
10602 3427 Supreme Chancel 2367 man 61.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.170325
10603 3427 yoda 923 man 61.0 Star Wars: Episode III - Revenge of the Sith 2005 515.0 0.066417
10604 3433 admiral piett 232 man 43.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.025127
10605 3433 ben (obi-wan) k 177 man 66.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.019170
10606 3433 C-3PO 1474 man 34.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.159645
10607 3433 captain needa 108 man 42.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.011697
10608 3433 darth vader 853 man 49.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.092386
10609 3433 general veers 114 man 45.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.012347
10610 3433 han solo 2024 man 38.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.219214
10611 3433 lando calrissia 798 man 43.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.086429
10612 3433 luke skywalker 1282 man 29.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.138850
10613 3433 princess leia 1025 woman 24.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.111015
10614 3433 rebel force gen 168 man 50.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.018196
10615 3433 snow creature 227 man NaN Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.024586
10616 3433 yoda 751 man 36.0 Star Wars: Episode V - The Empire Strikes Back 1980 937.0 0.081339
10617 3437 anakin skywalke 1697 man 10.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.158037
10618 3437 boss nass 209 man 63.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.019464
10619 3437 Captain Panaka 390 man 45.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.036320
10620 3437 chancellor valo 127 man 61.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.011827
10621 3437 darth sidious 277 man NaN Star Wars: Episode I - The Phantom Menace 1999 813.0 0.025796
10622 3437 jar jar binks 632 man 26.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.058856
10623 3437 mace windu 247 man 51.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.023002
10624 3437 nute gunray 544 man 34.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.050661
10625 3437 obi-wan kenobi 701 man 28.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.065282
10626 3437 padme 718 woman 18.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.066865
10627 3437 queen amidala 656 woman 18.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.061091
10628 3437 qui-gon jinn 2339 man 47.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.217825
10629 3437 ric oli� 132 man 42.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.012293
10630 3437 rune haako 126 man NaN Star Wars: Episode I - The Phantom Menace 1999 813.0 0.011734
10631 3437 Senator Palpati 551 man 55.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.051313
10632 3437 shmi skywalker 414 woman 41.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.038555
10633 3437 sio bibble 164 man 60.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.015273
10634 3437 watto 502 man 46.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.046750
10635 3437 yoda 312 man 55.0 Star Wars: Episode I - The Phantom Menace 1999 813.0 0.029056
21000 8099 bala-tik 140 man NaN Star Wars: Episode VII - The Force Awakens 2015 927.0 0.021002
21001 8099 c-3po 138 man 69.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.020702
21003 8099 finn 1447 man 23.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.217072
21004 8099 general hux 356 man 32.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.053405
21005 8099 han solo 1225 man 73.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.183768
21006 8099 kylo ren 618 man 32.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.092709
21007 8099 maz kanata 241 woman 32.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.036154
21008 8099 poe dameron 604 man 36.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.090609
21009 8099 princess leia 252 woman 59.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.037804
21010 8099 rey 1363 woman 23.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.204470
21011 8099 stormtrooper 114 man NaN Star Wars: Episode VII - The Force Awakens 2015 927.0 0.017102
21012 8099 supreme leader 168 man 51.0 Star Wars: Episode VII - The Force Awakens 2015 927.0 0.025203

We can use the .str.contains() to search for particular words or phrases in a column, such as “Mean Girls.”

film_df[film_df['title'].str.contains('Mean Girls')]
script_id character words gender age title release_year gross proportion_of_dialogue
13938 4605 aaron samuels 426 man 23.0 Mean Girls 2004 120.0 0.053890
13939 4605 cady heron 2798 woman 18.0 Mean Girls 2004 120.0 0.353953
13940 4605 damian 624 man 26.0 Mean Girls 2004 120.0 0.078937
13941 4605 gretchen wiener 609 woman 22.0 Mean Girls 2004 120.0 0.077040
13942 4605 janis ian 907 woman 22.0 Mean Girls 2004 120.0 0.114738
13943 4605 karen smith 301 woman 19.0 Mean Girls 2004 120.0 0.038077
13944 4605 mr. duvall 365 man 43.0 Mean Girls 2004 120.0 0.046173
13945 4605 mrs. george 125 woman 33.0 Mean Girls 2004 120.0 0.015813
13946 4605 ms. norbury 720 woman 34.0 Mean Girls 2004 120.0 0.091082
13947 4605 regina george 1030 woman 26.0 Mean Girls 2004 120.0 0.130297

Applying Functions#

With the .apply() method, we can run a function on every single row in a Pandas column or dataframe.

def make_text_title_case(text):
    title_case_text = text.title()
    return title_case_text
make_text_title_case("betty")
'Betty'
film_df['character'].apply(make_text_title_case)
0                  Betty
1        Carolyn Johnson
2                Eleanor
3        Francesca Johns
4                  Madge
              ...       
23046            Lumiere
23047            Maurice
23048    Monsieur D'Arqu
23049         Mrs. Potts
23050           Wardrobe
Name: character, Length: 23047, dtype: object
film_df['character'] = film_df['character'].apply(make_text_title_case)
film_df.sample(10)
script_id character words gender age title release_year gross proportion_of_dialogue
21944 8559 Joey Pinero 288 man 40.0 Primal Fear 1996 110.0 0.021510
1698 1044 Anna Mcdoogles 1346 woman 37.0 The Invention of Lying 2009 21.0 0.160105
11675 3746 Randy 'The Ram' 5220 man 56.0 The Wrestler 2008 31.0 0.496009
801 821 Peter 804 man NaN Contact 1997 191.0 0.068449
15636 5236 Michael Jackson 151 man NaN Willard 1971 101.0 0.016402
1059 885 Lord Howard 154 man 65.0 Elizabeth: The Golden Age 2007 20.0 0.021050
9011 2949 Cosmo Castorini 1196 man 67.0 Moonstruck 1987 179.0 0.063957
473 745 Deputy Da Ellis 462 man 37.0 The Black Dahlia 2006 29.0 0.040477
17745 6535 Rick Cowley 150 man 46.0 The Dark Half 1993 20.0 0.014219
2086 1140 Dr. Arthur Neum 267 man 50.0 The Mask 1994 249.0 0.042995

Filter DataFrame

We can filter the DataFrames for only characters who are men or women.

men_film_df = film_df[film_df['gender'] == 'man']
women_film_df = film_df[film_df['gender'] == 'woman']

Groupby

We can use the .groupby() function to group all the men characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of men speaking.

men_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]
proportion_of_dialogue
title
The Men Who Stare at Goats 1.0
Kagemusha 1.0
The Wild Bunch 1.0
Killing Them Softly 1.0
Stalag 17 1.0
There Will Be Blood 1.0
Fury 1.0
The Revenant 1.0
Saving Private Ryan 1.0
Crimson Tide 1.0
Die Hard: With a Vengeance 1.0
Jeremiah Johnson 1.0
Conquest of the Planet of the Apes 1.0
Duel 1.0
The Hunt for Red October 1.0
Platoon 1.0
Reservoir Dogs 1.0
The Battle of Algiers 1.0
Armored 1.0
Space Jam 1.0

We can use the .groupby() function to group all the women characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of women speaking.

women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]
proportion_of_dialogue
title
The Descent 1.000000
Now and Then 1.000000
Precious 0.993541
Martyrs 0.965550
The Hand That Rocks the Cradle 0.933750
Agnes of God 0.922482
Heavenly Creatures 0.919368
The Help 0.916947
3 Women 0.899530
The Watermelon Woman 0.894676
Grandma 0.894137
The Craft 0.891866
Easy A 0.888131
The Others 0.879972
Black Christmas 0.877467
Wadjda 0.870567
For Colored Girls 0.870247
The Roommate 0.869879
Mrs. Winterbourne 0.864730
Practical Magic 0.856693

Reset Index#

We can transform a Groupby object into a DataFrame with a regular Index by tacking on .reset_index().

women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False).reset_index()
title proportion_of_dialogue
0 The Descent 1.000000
1 Now and Then 1.000000
2 Precious 0.993541
3 Martyrs 0.965550
4 The Hand That Rocks the Cradle 0.933750
... ... ...
1935 The Last Castle 0.011139
1936 The Damned United 0.010909
1937 Thirteen Days 0.010834
1938 Men in Black 3 0.007812
1939 Full Metal Jacket 0.007170

1940 rows × 2 columns

Bonus — Interactive Data Visualization#

#Import necessary Bokeh modules
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, NumeralTickFormatter
from bokeh.io import output_notebook, show
from bokeh.palettes import RdBu
from bokeh.transform import linear_cmap, jitter
#Set up Bokeh to work in Jupyter notebook
output_notebook()
Loading BokehJS ...

Here’s an interactive data visualization of these Hollywood films by release year and percentage of women dialogue. This data viz was created with the Python library Bokeh, which we will discuss in some later lessons.

  • Scroll to zoom in

  • Hover to see more information about each point

To see the code that created this visualization, select “Click to show” below.

Hide code cell source
#Make groupby into a new DataFrame
dialogue_df = women_film_df.groupby(['title', 'release_year'])[['proportion_of_dialogue']].sum()\
.sort_values(by='proportion_of_dialogue', ascending=False).reset_index()

# Set up the source data that will suppply the x,y columns and the film title hover text
source = ColumnDataSource(dialogue_df)

# Set the hover tool tip to the film title, release year, and proportion of dialogue

TOOLTIPS = [("Title", "@title"),
            ("Year", "@release_year"),
           ("Women Dialogue", "@{proportion_of_dialogue}{%0.2f}")]

#Set up Bokeh plot with title, labels, 
bokeh_plot = figure(title="How Much Do Women Speak in Hollywood Films?", x_axis_label = 'Film Release Year',
                    y_axis_label = 'Amount of Dialogue Spoken By Women',x_range = [1930, 2018], y_range = [0, 1.01],
                 tooltips=TOOLTIPS, width=800, height=550, active_scroll='wheel_zoom')

# Create a red to blue color palette
color_mapper = linear_cmap(field_name='proportion_of_dialogue', palette=RdBu[4], low=1.1, high=0)

# Supply inidivudal points values
bokeh_plot.circle(y='proportion_of_dialogue', x=jitter('release_year', width=.2),
         size = 10,
        line_color='black',
        line_alpha=.4,
         source=source,
         color=color_mapper, alpha=.5)

bokeh_plot.title.text_font_size='20pt'

#Make Y axis percentages
bokeh_plot.yaxis.formatter = NumeralTickFormatter(format='0 %')

show(bokeh_plot)