Pandas Basics — Part 3#
Note: You can explore the associated workbook for this chapter in the cloud.
In this lesson, we’re going to introduce more fundamentals of Pandas, a powerful Python library for working with tabular data like CSV files.
We will review skills learned from the last two lessons and introduce how to:
Check for duplicate data
Clean and transform data
Manipulate string data
Apply functions
Reset index
Bonus! Create an interactive data viz
Dataset#
The Pudding’s Film Dialogue Data#
Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles.
But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?
-Hannah Anderson and Matt Daniels, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”
The dataset that we’re working with in this lesson is taken from Hannah Andersen and Matt Daniels’s Pudding essay, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”. The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how “white men dominate movie roles.”
Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They claim, in fact, that one of the most frequently asked questions about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!” The short answer is that they don’t. To determine character gender, they used actors’ IMDB information, which they acknowledge is an imperfect approach: “Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.”
As we work with this data, we want to be critical and cognizant of this approach to gender. How does such a binary understanding of gender, gleaned from the IMDB pages of actors, influence our later results and conclusions? What do we gain by using such an approach, and what do we lose? How else might we have encoded or determined gender for the same data?
Import Pandas
To use the Pandas library, we first need to import
it.
import pandas as pd
The above import
statement not only imports the Pandas library but also gives it an alias or nickname — pd
. This alias will save us from having to type out the entire words pandas
each time we need to use it. Many Python libraries have commonly used aliases like pd
.
Set Display Settings
By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.
pd.options.display.max_rows = 100
Read in CSV File
To read in a CSV file, we will use the function pd.read_csv()
and insert the name of our desired file path.
film_df = pd.read_csv('../data/Pudding/Pudding-Film-Dialogue-Salty.csv', delimiter=",", encoding='utf-8')
When reading in the CSV file, we also specified the encoding
and delimiter
. The delimiter
parameter specifies the character that separates or “delimits” the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for Comma Separated Values.) Sometimes, however, the delimiter of a CSV file might be a tab (\t
) or, more rarely, another character.
Display Data
We can display a DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.
Pandas Review
NaN
is the Pandas value for any missing data. See “Working with missing data” for more information.
film_df
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 280 | betty | 311 | f | 35.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.048639 |
1 | 280 | carolyn johnson | 873 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.136534 |
2 | 280 | eleanor | 138 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.021583 |
3 | 280 | francesca johns | 2251 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.352049 |
4 | 280 | madge | 190 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.029715 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23047 | 9254 | maurice | 1107 | m | 71.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.108967 |
23048 | 9254 | monsieur d'arqu | 114 | m | 58.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.011222 |
23049 | 9254 | mrs. potts | 564 | f | 66.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.055517 |
23050 | 9254 | wardrobe | 121 | f | 54.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.011911 |
23051 | 9254 | mrs. potts | 564 | f | 66.0 | tt0101414 | Beauty and the Beast | 1991 | 452.0 | 0.055517 |
23052 rows × 10 columns
There are a few important things to note about the DataFrame displayed here:
Index
The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.
Truncation
The DataFrame is truncated, signaled by the ellipses in the middle
...
of every column.The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.
Rows x Columns
Pandas reports how many rows and columns are in this dataset at the bottom of the output (23,052 x 10 columns).
Display First n Rows
To look at the first n rows in a DataFrame, we can use a method called .head()
.
film_df.head(10)
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 280 | betty | 311 | f | 35.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.048639 |
1 | 280 | carolyn johnson | 873 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.136534 |
2 | 280 | eleanor | 138 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.021583 |
3 | 280 | francesca johns | 2251 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.352049 |
4 | 280 | madge | 190 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.029715 |
5 | 280 | michael johnson | 723 | m | 38.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.113075 |
6 | 280 | robert kincaid | 1908 | m | 65.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.298405 |
7 | 623 | bobby korfin | 328 | m | NaN | tt0179626 | 15 Minutes | 2001 | 37.0 | 0.036012 |
8 | 623 | daphne handlova | 409 | f | 28.0 | tt0179626 | 15 Minutes | 2001 | 37.0 | 0.044906 |
9 | 623 | deputy chief fi | 347 | m | NaN | tt0179626 | 15 Minutes | 2001 | 37.0 | 0.038098 |
Display Random Sample
To look at a random sample of rows, we can use the .sample()
method.
film_df.sample(10)
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
3519 | 1483 | auntie em | 295 | f | 63.0 | tt0032138 | The Wizard of Oz | 1939 | 839.0 | 0.031426 |
5395 | 1991 | holden mcneil | 9462 | m | 25.0 | tt0118842 | Chasing Amy | 1997 | 22.0 | 0.300973 |
5238 | 1957 | lightning mcque | 1968 | m | 43.0 | tt1216475 | Cars 2 | 2011 | 209.0 | 0.155622 |
2992 | 1350 | tariq asani | 267 | m | 31.0 | tt0360009 | Spartan | 2004 | 6.0 | 0.043337 |
18281 | 6858 | mara | 568 | f | NaN | tt0114852 | Village of the Damned | 1995 | 18.0 | 0.099006 |
8333 | 2764 | nurse #1 | 162 | f | 37.0 | tt0264796 | Life as a House | 2001 | 23.0 | 0.007482 |
13345 | 4394 | jonathan | 1168 | m | 47.0 | tt1691920 | I Melt with You | 2011 | NaN | 0.159041 |
16889 | 6053 | bomber | 392 | m | NaN | tt1532958 | Battle of the Year | 2013 | 9.0 | 0.035617 |
15704 | 5297 | mr. green | 825 | m | 38.0 | tt0088930 | Clue | 1985 | 7.0 | 0.087701 |
75 | 642 | kurt zagon | 914 | m | 60.0 | tt0094602 | Above the Law | 1988 | 40.0 | 0.116522 |
Examine Data
Shape
To explicitly check for how many rows vs columns make up a dataset, we can use the .shape
method.
film_df.shape
(23052, 10)
There are 23,052 rows and 10 columns.
Data Types
Just like Python has different data types, Pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the .dtypes
method.
Pandas Data Type |
Explanation |
---|---|
|
string |
|
float |
|
integer |
|
date time |
film_df.dtypes
script_id int64
imdb_character_name object
words int64
gender object
age float64
imdb_id object
title object
year int64
gross float64
proportion_of_dialogue float64
dtype: object
It’s important to always check the data types in your DataFrame. For example, sometimes numeric values will accidentally be interpreted as a string object. To perform calculations on this data, you would need to first convert that column from a string to an integer.
Columns
We can also check the column names of the DataFrame with .columns
film_df.columns
Index(['script_id', 'imdb_character_name', 'words', 'gender', 'age', 'imdb_id',
'title', 'year', 'gross', 'proportion_of_dialogue'],
dtype='object')
Summary Statistics
film_df.describe(include='all')
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
count | 23052.000000 | 23050 | 23052.000000 | 23052 | 18266.000000 | 23052 | 23052 | 23052.000000 | 19390.000000 | 23052.000000 |
unique | NaN | 17613 | NaN | 3 | NaN | 2000 | 1994 | NaN | NaN | NaN |
top | NaN | doctor | NaN | m | NaN | tt0116905 | Lone Star | NaN | NaN | NaN |
freq | NaN | 36 | NaN | 16135 | NaN | 40 | 40 | NaN | NaN | NaN |
mean | 4195.580904 | NaN | 907.870684 | NaN | 42.386839 | NaN | NaN | 1998.133828 | 106.805570 | 0.086512 |
std | 2473.518048 | NaN | 1399.538583 | NaN | 59.714730 | NaN | NaN | 14.745342 | 145.992394 | 0.107740 |
min | 280.000000 | NaN | 101.000000 | NaN | 3.000000 | NaN | NaN | 1929.000000 | 0.000000 | 0.001537 |
25% | 2095.000000 | NaN | 193.000000 | NaN | 30.000000 | NaN | NaN | 1992.000000 | 22.000000 | 0.019771 |
50% | 3694.000000 | NaN | 396.000000 | NaN | 39.000000 | NaN | NaN | 2001.000000 | 56.000000 | 0.042421 |
75% | 6234.000000 | NaN | 980.000000 | NaN | 50.000000 | NaN | NaN | 2009.000000 | 136.000000 | 0.104166 |
max | 9254.000000 | NaN | 28102.000000 | NaN | 2013.000000 | NaN | NaN | 2015.000000 | 1798.000000 | 0.923422 |
Do you notice any outliers, anomalies, or potential problems here?
The maximum value in the “age” column is 2013! That seems like an error.
film_df[film_df['age'] == 2013]
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
11639 | 3737 | lucas solomon | 190 | m | 2013.0 | tt0993846 | The Wolf of Wall Street | 2013 | 125.0 | 0.011077 |
Let’s drop this row from the dataset by using the .drop()
method and the Index number of the row.
film_df = film_df.drop(11639)
Now if we look for it again, that row is gone.
film_df[film_df['age'] == 2013]
script_id | imdb_character_name | words | gender | age | imdb_id | title | year | gross | proportion_of_dialogue |
---|
Rename Columns
film_df = film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'})
film_df.head()
script_id | character | words | gender | age | imdb_id | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 280 | betty | 311 | f | 35.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.048639 |
1 | 280 | carolyn johnson | 873 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.136534 |
2 | 280 | eleanor | 138 | f | NaN | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.021583 |
3 | 280 | francesca johns | 2251 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.352049 |
4 | 280 | madge | 190 | f | 46.0 | tt0112579 | The Bridges of Madison County | 1995 | 142.0 | 0.029715 |
Drop Columns
film_df = film_df.drop(columns='imdb_id')
Missing Data
.isna() / .notna()
Pandas has special ways of dealing with missing data. As you may have already noticed, blank rows in a CSV file show up as NaN
in a Pandas DataFrame.
To filter and count the number of missing/not missing values in a dataset, we can use the special .isna()
and .notna()
methods on a DataFrame or Series object.
The .isna()
and .notna()
methods return True/False pairs for each row, which we can use to filter the DataFrame for any rows that have information in a given column.
film_df[film_df['character'].isna()]
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|
4656 | 1807 | NaN | 146 | f | NaN | Beloved | 1998 | 42.0 | 0.005162 |
19448 | 7445 | NaN | 520 | f | NaN | House of Games | 1987 | 5.0 | 0.165184 |
This is important information for the sake of better understanding our dataset. But it’s also important because NaN
values are treated as floats, not strings. If we tried to manipulate this column as text data, we would get an error. For this reason, we’re going to replace or “fill” these NaN
values with the string “No Character Data” by using the .fillna()
method.
film_df['character'] = film_df['character'].fillna('No Character Data')
film_df[film_df['character'].isna()]
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue |
---|
Check for Duplicates#
Duplicates#
We can check for duplicate rows by using the .duplicated()
method and setting the parameter keep=False
, which will display all the duplicated values in the dataset — rather than just the first duplicated value keep='first'
or the last duplicated value keep='last'
.
film_df.duplicated(keep=False)
0 False
1 False
2 False
3 False
4 False
...
23047 False
23048 False
23049 True
23050 False
23051 True
Length: 23051, dtype: bool
The output above is reporting whether each row in the dataset is a duplicate. We can use the .duplicated()
method inside a filter to isolate only the rows in the dataframe that are exact duplicates.
film_df[film_df.duplicated(keep=False)]
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|
21001 | 8099 | c-3po | 138 | m | 69.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.020702 |
21002 | 8099 | c-3po | 138 | m | 69.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.020702 |
21963 | 8560 | kilgore trout | 2673 | m | 63.0 | Breakfast of Champions | 1999 | NaN | 0.197211 |
21964 | 8560 | kilgore trout | 2673 | m | 63.0 | Breakfast of Champions | 1999 | NaN | 0.197211 |
22935 | 9222 | groomsman #5 | 238 | m | NaN | Wedding Crashers | 2005 | 283.0 | 0.012032 |
22936 | 9222 | groomsman #5 | 238 | m | NaN | Wedding Crashers | 2005 | 283.0 | 0.012032 |
23049 | 9254 | mrs. potts | 564 | f | 66.0 | Beauty and the Beast | 1991 | 452.0 | 0.055517 |
23051 | 9254 | mrs. potts | 564 | f | 66.0 | Beauty and the Beast | 1991 | 452.0 | 0.055517 |
We can drop duplicates from the DataFrame with the .drop_duplicates()
method and choose to keep the first instance of the duplicate or the last instance.
film_df = film_df.drop_duplicates(keep='first')
Now if we check the data for duplicates again, they should be all gone.
film_df[film_df.duplicated(keep=False)]
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue |
---|
Clean and Transform Data#
Pandas .str
Methods#
Remember all the special things that you can do with Python strings aka string methods?
Pandas has special Pandas string methods, too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add .str
to the method chain.
Pandas String Method |
Explanation |
---|---|
df[‘column_name’] |
makes the string in each row lowercase |
df[‘column_name’] |
makes the string in each row uppercase |
df[‘column_name’] |
makes the string in each row titlecase |
df[‘column_name’] |
replaces |
df[‘column_name’] |
tests whether string in each row contains “some string” |
df[‘column_name’] |
returns a list of substrings separated by the given delimiter |
df[‘column_name’] |
opposite of split(), joins the elements in the given list together using the string |
For example, to transform every character’s name in the “character” column from lowercase to uppercase, we can use .str.upper()
film_df['character'].str.upper()
0 BETTY
1 CAROLYN JOHNSON
2 ELEANOR
3 FRANCESCA JOHNS
4 MADGE
...
23046 LUMIERE
23047 MAURICE
23048 MONSIEUR D'ARQU
23049 MRS. POTTS
23050 WARDROBE
Name: character, Length: 23047, dtype: object
To transform every character’s name in the “character” column to lowercase, we can use .str.lower()
film_df['character'].str.lower()
0 betty
1 carolyn johnson
2 eleanor
3 francesca johns
4 madge
...
23046 lumiere
23047 maurice
23048 monsieur d'arqu
23049 mrs. potts
23050 wardrobe
Name: character, Length: 23047, dtype: object
If we want to replace the gender columns’s single letter abbreviation for “male” / “female” (sex) with “man” / “woman” (gender identity), we could use the .str.replace()
method.
film_df['gender'] = film_df['gender'].str.replace('m', 'man')
film_df['gender'] = film_df['gender'].str.replace('f', 'woman')
film_df.sample(10)
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|
18526 | 7014 | julie | 256 | woman | 22.0 | Showgirls | 1995 | 40.0 | 0.023103 |
10160 | 3289 | charlie mccorry | 584 | man | 40.0 | The Searchers | 1956 | NaN | 0.036878 |
15578 | 5214 | guy | 163 | man | NaN | The Wedding Singer | 1998 | 148.0 | 0.011877 |
649 | 793 | ellis | 223 | man | NaN | Cellular | 2004 | 44.0 | 0.023130 |
892 | 851 | peterson | 207 | man | NaN | Deception | 2008 | 5.0 | 0.024247 |
1190 | 915 | clear rivers | 472 | woman | 24.0 | Final Destination | 2000 | 86.0 | 0.163946 |
3032 | 1360 | lama su | 350 | man | 48.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.046574 |
10153 | 3288 | somerset | 8838 | man | 58.0 | Se7en | 1995 | 200.0 | 0.428281 |
10421 | 3352 | anna crowe | 916 | woman | 31.0 | The Sixth Sense | 1999 | 503.0 | 0.080549 |
7132 | 2442 | noah | 160 | man | 32.0 | The Grapes of Wrath | 1940 | NaN | 0.006841 |
We can use the .str.contains()
to search for particular words or phrases in a column, such as “Star Wars.”
film_df[film_df['title'].str.contains('Star Wars')]
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|
3017 | 1359 | admiral ackbar | 199 | man | 61.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.039096 |
3018 | 1359 | ben 'obi-wan' k | 462 | man | 69.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.090766 |
3019 | 1359 | c-3po | 881 | man | 37.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.173084 |
3020 | 1359 | darth vader | 381 | man | 48.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.074853 |
3021 | 1359 | han solo | 835 | man | 41.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.164047 |
3022 | 1359 | lando calrissia | 379 | man | 46.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.074460 |
3023 | 1359 | luke skywalker | 915 | man | 32.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.179764 |
3024 | 1359 | princess leia | 359 | woman | 27.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.070530 |
3025 | 1359 | the emperor | 516 | man | 39.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.101375 |
3026 | 1359 | yoda | 163 | man | 19.0 | Star Wars: Episode VI - Return of the Jedi | 1983 | 853.0 | 0.032024 |
3027 | 1360 | anakin skywalke | 2061 | man | 21.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.274251 |
3028 | 1360 | cliegg lars | 152 | man | 62.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.020226 |
3029 | 1360 | count dooku | 321 | man | 80.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.042715 |
3030 | 1360 | dexter jettster | 151 | man | 67.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.020093 |
3031 | 1360 | jar jar binks | 294 | man | 29.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.039122 |
3032 | 1360 | lama su | 350 | man | 48.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.046574 |
3033 | 1360 | mace windu | 365 | man | 54.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.048570 |
3034 | 1360 | madame jocasta | 331 | woman | 82.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.044045 |
3035 | 1360 | obi-wan kenobi | 1477 | man | 31.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.196540 |
3036 | 1360 | padme | 668 | woman | 21.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.088889 |
3037 | 1360 | queen jamillia | 135 | woman | 25.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.017964 |
3038 | 1360 | supreme chancel | 576 | man | 58.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.076647 |
3039 | 1360 | taun we | 139 | woman | 40.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.018496 |
3040 | 1360 | watto | 140 | man | 49.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.018629 |
3041 | 1360 | yoda | 355 | man | 58.0 | Star Wars: Episode II - Attack of the Clones | 2002 | 465.0 | 0.047239 |
3058 | 1362 | ben obi-wan ken | 995 | man | 63.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.099084 |
3059 | 1362 | c-3po | 1506 | man | 31.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.149970 |
3060 | 1362 | cmdr. praji (im | 117 | man | 42.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.011651 |
3061 | 1362 | darth vader | 492 | man | 46.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.048994 |
3062 | 1362 | general dodonna | 185 | man | 66.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.018423 |
3063 | 1362 | grand moff tark | 411 | man | 64.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.040928 |
3064 | 1362 | greedo | 103 | man | 33.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.010257 |
3065 | 1362 | han solo | 1730 | man | 35.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.172276 |
3066 | 1362 | jabba | 185 | man | NaN | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.018423 |
3067 | 1362 | luke skywalker | 2485 | man | 26.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.247461 |
3068 | 1362 | princess leia o | 636 | woman | 21.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.063334 |
3069 | 1362 | red leader | 241 | man | 37.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.023999 |
3070 | 1362 | red three (bigg | 523 | man | 38.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.052081 |
3071 | 1362 | stormtrooper | 122 | man | 36.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.012149 |
3072 | 1362 | uncle owen | 311 | man | 61.0 | Star Wars: Episode IV - A New Hope | 1977 | 1798.0 | 0.030970 |
10591 | 3427 | anakin skywalke | 3768 | man | 24.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.271138 |
10592 | 3427 | c-3po | 312 | man | 59.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.022451 |
10593 | 3427 | Clone Jedi Knig | 104 | man | NaN | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.007484 |
10594 | 3427 | darth sldious | 244 | man | NaN | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.017558 |
10595 | 3427 | general grievou | 317 | man | 33.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.022811 |
10596 | 3427 | Ki-Adi-Mundi / | 107 | man | 40.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.007700 |
10597 | 3427 | Mace Windu | 407 | man | 57.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.029287 |
10598 | 3427 | mon mothma | 127 | woman | 28.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.009139 |
10599 | 3427 | obi-wan kenobi | 3181 | man | 34.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.228898 |
10600 | 3427 | Padme | 1561 | woman | 24.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.112326 |
10601 | 3427 | senator bail or | 479 | man | 50.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.034468 |
10602 | 3427 | Supreme Chancel | 2367 | man | 61.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.170325 |
10603 | 3427 | yoda | 923 | man | 61.0 | Star Wars: Episode III - Revenge of the Sith | 2005 | 515.0 | 0.066417 |
10604 | 3433 | admiral piett | 232 | man | 43.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.025127 |
10605 | 3433 | ben (obi-wan) k | 177 | man | 66.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.019170 |
10606 | 3433 | C-3PO | 1474 | man | 34.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.159645 |
10607 | 3433 | captain needa | 108 | man | 42.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.011697 |
10608 | 3433 | darth vader | 853 | man | 49.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.092386 |
10609 | 3433 | general veers | 114 | man | 45.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.012347 |
10610 | 3433 | han solo | 2024 | man | 38.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.219214 |
10611 | 3433 | lando calrissia | 798 | man | 43.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.086429 |
10612 | 3433 | luke skywalker | 1282 | man | 29.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.138850 |
10613 | 3433 | princess leia | 1025 | woman | 24.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.111015 |
10614 | 3433 | rebel force gen | 168 | man | 50.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.018196 |
10615 | 3433 | snow creature | 227 | man | NaN | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.024586 |
10616 | 3433 | yoda | 751 | man | 36.0 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 937.0 | 0.081339 |
10617 | 3437 | anakin skywalke | 1697 | man | 10.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.158037 |
10618 | 3437 | boss nass | 209 | man | 63.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.019464 |
10619 | 3437 | Captain Panaka | 390 | man | 45.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.036320 |
10620 | 3437 | chancellor valo | 127 | man | 61.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.011827 |
10621 | 3437 | darth sidious | 277 | man | NaN | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.025796 |
10622 | 3437 | jar jar binks | 632 | man | 26.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.058856 |
10623 | 3437 | mace windu | 247 | man | 51.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.023002 |
10624 | 3437 | nute gunray | 544 | man | 34.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.050661 |
10625 | 3437 | obi-wan kenobi | 701 | man | 28.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.065282 |
10626 | 3437 | padme | 718 | woman | 18.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.066865 |
10627 | 3437 | queen amidala | 656 | woman | 18.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.061091 |
10628 | 3437 | qui-gon jinn | 2339 | man | 47.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.217825 |
10629 | 3437 | ric oli� | 132 | man | 42.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.012293 |
10630 | 3437 | rune haako | 126 | man | NaN | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.011734 |
10631 | 3437 | Senator Palpati | 551 | man | 55.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.051313 |
10632 | 3437 | shmi skywalker | 414 | woman | 41.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.038555 |
10633 | 3437 | sio bibble | 164 | man | 60.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.015273 |
10634 | 3437 | watto | 502 | man | 46.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.046750 |
10635 | 3437 | yoda | 312 | man | 55.0 | Star Wars: Episode I - The Phantom Menace | 1999 | 813.0 | 0.029056 |
21000 | 8099 | bala-tik | 140 | man | NaN | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.021002 |
21001 | 8099 | c-3po | 138 | man | 69.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.020702 |
21003 | 8099 | finn | 1447 | man | 23.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.217072 |
21004 | 8099 | general hux | 356 | man | 32.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.053405 |
21005 | 8099 | han solo | 1225 | man | 73.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.183768 |
21006 | 8099 | kylo ren | 618 | man | 32.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.092709 |
21007 | 8099 | maz kanata | 241 | woman | 32.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.036154 |
21008 | 8099 | poe dameron | 604 | man | 36.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.090609 |
21009 | 8099 | princess leia | 252 | woman | 59.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.037804 |
21010 | 8099 | rey | 1363 | woman | 23.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.204470 |
21011 | 8099 | stormtrooper | 114 | man | NaN | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.017102 |
21012 | 8099 | supreme leader | 168 | man | 51.0 | Star Wars: Episode VII - The Force Awakens | 2015 | 927.0 | 0.025203 |
We can use the .str.contains()
to search for particular words or phrases in a column, such as “Mean Girls.”
film_df[film_df['title'].str.contains('Mean Girls')]
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|
13938 | 4605 | aaron samuels | 426 | man | 23.0 | Mean Girls | 2004 | 120.0 | 0.053890 |
13939 | 4605 | cady heron | 2798 | woman | 18.0 | Mean Girls | 2004 | 120.0 | 0.353953 |
13940 | 4605 | damian | 624 | man | 26.0 | Mean Girls | 2004 | 120.0 | 0.078937 |
13941 | 4605 | gretchen wiener | 609 | woman | 22.0 | Mean Girls | 2004 | 120.0 | 0.077040 |
13942 | 4605 | janis ian | 907 | woman | 22.0 | Mean Girls | 2004 | 120.0 | 0.114738 |
13943 | 4605 | karen smith | 301 | woman | 19.0 | Mean Girls | 2004 | 120.0 | 0.038077 |
13944 | 4605 | mr. duvall | 365 | man | 43.0 | Mean Girls | 2004 | 120.0 | 0.046173 |
13945 | 4605 | mrs. george | 125 | woman | 33.0 | Mean Girls | 2004 | 120.0 | 0.015813 |
13946 | 4605 | ms. norbury | 720 | woman | 34.0 | Mean Girls | 2004 | 120.0 | 0.091082 |
13947 | 4605 | regina george | 1030 | woman | 26.0 | Mean Girls | 2004 | 120.0 | 0.130297 |
Applying Functions#
With the .apply()
method, we can run a function on every single row in a Pandas column or dataframe.
def make_text_title_case(text):
title_case_text = text.title()
return title_case_text
make_text_title_case("betty")
'Betty'
film_df['character'].apply(make_text_title_case)
0 Betty
1 Carolyn Johnson
2 Eleanor
3 Francesca Johns
4 Madge
...
23046 Lumiere
23047 Maurice
23048 Monsieur D'Arqu
23049 Mrs. Potts
23050 Wardrobe
Name: character, Length: 23047, dtype: object
film_df['character'] = film_df['character'].apply(make_text_title_case)
film_df.sample(10)
script_id | character | words | gender | age | title | release_year | gross | proportion_of_dialogue | |
---|---|---|---|---|---|---|---|---|---|
21944 | 8559 | Joey Pinero | 288 | man | 40.0 | Primal Fear | 1996 | 110.0 | 0.021510 |
1698 | 1044 | Anna Mcdoogles | 1346 | woman | 37.0 | The Invention of Lying | 2009 | 21.0 | 0.160105 |
11675 | 3746 | Randy 'The Ram' | 5220 | man | 56.0 | The Wrestler | 2008 | 31.0 | 0.496009 |
801 | 821 | Peter | 804 | man | NaN | Contact | 1997 | 191.0 | 0.068449 |
15636 | 5236 | Michael Jackson | 151 | man | NaN | Willard | 1971 | 101.0 | 0.016402 |
1059 | 885 | Lord Howard | 154 | man | 65.0 | Elizabeth: The Golden Age | 2007 | 20.0 | 0.021050 |
9011 | 2949 | Cosmo Castorini | 1196 | man | 67.0 | Moonstruck | 1987 | 179.0 | 0.063957 |
473 | 745 | Deputy Da Ellis | 462 | man | 37.0 | The Black Dahlia | 2006 | 29.0 | 0.040477 |
17745 | 6535 | Rick Cowley | 150 | man | 46.0 | The Dark Half | 1993 | 20.0 | 0.014219 |
2086 | 1140 | Dr. Arthur Neum | 267 | man | 50.0 | The Mask | 1994 | 249.0 | 0.042995 |
Filter DataFrame
We can filter the DataFrames for only characters who are men or women.
men_film_df = film_df[film_df['gender'] == 'man']
women_film_df = film_df[film_df['gender'] == 'woman']
Groupby
We can use the .groupby()
function to group all the men characters in each film and sum up their total dialogue.
By adding a Python string slice, we can identify the top 20 films with the greatest proportion of men speaking.
men_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]
proportion_of_dialogue | |
---|---|
title | |
The Men Who Stare at Goats | 1.0 |
Kagemusha | 1.0 |
The Wild Bunch | 1.0 |
Killing Them Softly | 1.0 |
Stalag 17 | 1.0 |
There Will Be Blood | 1.0 |
Fury | 1.0 |
The Revenant | 1.0 |
Saving Private Ryan | 1.0 |
Crimson Tide | 1.0 |
Die Hard: With a Vengeance | 1.0 |
Jeremiah Johnson | 1.0 |
Conquest of the Planet of the Apes | 1.0 |
Duel | 1.0 |
The Hunt for Red October | 1.0 |
Platoon | 1.0 |
Reservoir Dogs | 1.0 |
The Battle of Algiers | 1.0 |
Armored | 1.0 |
Space Jam | 1.0 |
We can use the .groupby()
function to group all the women characters in each film and sum up their total dialogue.
By adding a Python string slice, we can identify the top 20 films with the greatest proportion of women speaking.
women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]
proportion_of_dialogue | |
---|---|
title | |
The Descent | 1.000000 |
Now and Then | 1.000000 |
Precious | 0.993541 |
Martyrs | 0.965550 |
The Hand That Rocks the Cradle | 0.933750 |
Agnes of God | 0.922482 |
Heavenly Creatures | 0.919368 |
The Help | 0.916947 |
3 Women | 0.899530 |
The Watermelon Woman | 0.894676 |
Grandma | 0.894137 |
The Craft | 0.891866 |
Easy A | 0.888131 |
The Others | 0.879972 |
Black Christmas | 0.877467 |
Wadjda | 0.870567 |
For Colored Girls | 0.870247 |
The Roommate | 0.869879 |
Mrs. Winterbourne | 0.864730 |
Practical Magic | 0.856693 |
Reset Index#
We can transform a Groupby object into a DataFrame with a regular Index by tacking on .reset_index()
.
women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False).reset_index()
title | proportion_of_dialogue | |
---|---|---|
0 | The Descent | 1.000000 |
1 | Now and Then | 1.000000 |
2 | Precious | 0.993541 |
3 | Martyrs | 0.965550 |
4 | The Hand That Rocks the Cradle | 0.933750 |
... | ... | ... |
1935 | The Last Castle | 0.011139 |
1936 | The Damned United | 0.010909 |
1937 | Thirteen Days | 0.010834 |
1938 | Men in Black 3 | 0.007812 |
1939 | Full Metal Jacket | 0.007170 |
1940 rows × 2 columns
Bonus — Interactive Data Visualization#
#Import necessary Bokeh modules
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, NumeralTickFormatter
from bokeh.io import output_notebook, show
from bokeh.palettes import RdBu
from bokeh.transform import linear_cmap, jitter
Here’s an interactive data visualization of these Hollywood films by release year and percentage of women dialogue. This data viz was created with the Python library Bokeh, which we will discuss in some later lessons.
Scroll to zoom in
Hover to see more information about each point
To see the code that created this visualization, select “Click to show” below.
Show code cell source
#Make groupby into a new DataFrame
dialogue_df = women_film_df.groupby(['title', 'release_year'])[['proportion_of_dialogue']].sum()\
.sort_values(by='proportion_of_dialogue', ascending=False).reset_index()
# Set up the source data that will suppply the x,y columns and the film title hover text
source = ColumnDataSource(dialogue_df)
# Set the hover tool tip to the film title, release year, and proportion of dialogue
TOOLTIPS = [("Title", "@title"),
("Year", "@release_year"),
("Women Dialogue", "@{proportion_of_dialogue}{%0.2f}")]
#Set up Bokeh plot with title, labels,
bokeh_plot = figure(title="How Much Do Women Speak in Hollywood Films?", x_axis_label = 'Film Release Year',
y_axis_label = 'Amount of Dialogue Spoken By Women',x_range = [1930, 2018], y_range = [0, 1.01],
tooltips=TOOLTIPS, width=800, height=550, active_scroll='wheel_zoom')
# Create a red to blue color palette
color_mapper = linear_cmap(field_name='proportion_of_dialogue', palette=RdBu[4], low=1.1, high=0)
# Supply inidivudal points values
bokeh_plot.circle(y='proportion_of_dialogue', x=jitter('release_year', width=.2),
size = 10,
line_color='black',
line_alpha=.4,
source=source,
color=color_mapper, alpha=.5)
bokeh_plot.title.text_font_size='20pt'
#Make Y axis percentages
bokeh_plot.yaxis.formatter = NumeralTickFormatter(format='0 %')
show(bokeh_plot)