Pandas Basics — Part 2#

Note: You can explore the associated workbook for this chapter in the cloud.

In this lesson, we’re going to introduce some more fundamentals of Pandas, a powerful Python library for working with tabular data like CSV files.

We will review skills learned from the last lesson and introduce how to:

  • Broadly examine data

  • Work with missing data

  • Rename, drop, and add new columns

  • Perform mathematical calculations

  • Aggregate subsets of data

  • Make a simple time series


Dataset#

The Trans-Atlantic Slave Trade Database#

[D]isplaying data alone could not and did not offer the atonement descendants of slaves sought or capture the inhumanity of this archive’s formation.

-Jessica Marie Johnson, “Markup Bodies”

The dataset that we’re going to be working with in this lesson is taken from The Trans-Atlantic Slave Trade Database, part of the Slave Voyages project. The larger database includes information about 35,000 slave-trading voyages from 1514-1866. The dataset we’re working with here was filtered to include the 20,000 voyages that landed in the Americas. The data was filtered to also include the percentage of enslaved men, women, and children on the voyages.

We’re working with this data for a number of reasons. The Slave Voyages project is a major data-driven contribution to the history of slavery and to the field of the digital humanities. Before the Trans-Atlantic Slave Trade Database, as DH scholar Jessica Johnson writes, “historians assumed enslaved women and children played a negligible role in the slave trade.” But evidence from the Trans-Atlantic Slave Trade Database suggested otherwise. “The existence of the Trans-Atlantic Slave Trade Database immediately reshaped debates about numbers of women and children exported from the continent,” Johnson says, “influencing work on women in the slave trade on the African coast, slavery in African societies, and women in the slave trade to the Americas.”

Though the Trans-Atlantic Slave Trade Database helped shed new light on the roles of enslaved women and children, Johnson makes clear that it was not computation or data alone that shed this light:

[D]isplaying data alone could not and did not offer the atonement descendants of slaves sought or capture the inhumanity of this archive’s formation. Culling the lives of women and children from the data set required approaching the data with intention. It required a methodology attuned to black life and to dismantling the methods used to create the manifests in the first place, then designing and launching an interface responsive to the desire of descendants of slaves for reparation and redress.

In this spirit, we want to think about how responsible data analysis requires more than just data and technical tools like Pandas. It requires approaching data with intention and developing methodologies geared toward justice. This is especially necessary when dealing with data that records and perpetrates violence like the Trans-Atlantic Slave Trade Database.


Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

The above import statement not only imports the Pandas library but also gives it an alias or nickname — pd. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd.

Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 100

Read in CSV File

Tip

If you use the help() function, you can see the documentation for almost any bit of code. If we run it on pd.read_csv(), we can see all the possible parameters that can be used with pd.read_csv(). help(pd.read_csv)

To read in a CSV file, we will use the function pd.read_csv() and insert the name of our desired file path.

slave_voyages_df = pd.read_csv('../data/Trans-Atlantic-Slave-Trade_Americas.csv', delimiter=",", encoding='utf-8')

This creates a Pandas DataFrame object — often abbreviated as df, e.g., slave_voyages_df. A DataFrame looks and acts a lot like a spreadsheet. But it has special powers and functions that we will discuss in the next few lessons.

When reading in the CSV file, we also specified the encoding and delimiter. The delimiter specifies the character that separates or “delimits” the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for Comma Separated Values.) Sometimes, however, the delimiter of a CSV file might be a tab (\t) or, more rarely, another character.

Display Data

We can display a DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.

Pandas Review

NaN is the Pandas value for any missing data. See “Working with missing data” for more information.

slave_voyages_df
year_of_arrival flag place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
0 1520 NaN Portuguese Guinea San Juan NaN NaN NaN 324.0 259.0 NaN NaN 42987 [u'AGI,Patronato 175, r.9<><p><em>AG!</em> (Se...
1 1525 Portugal / Brazil Sao Tome Hispaniola, unspecified NaN NaN NaN 359.0 287.0 NaN S Maria de Bogoña Monteiro, Pero 46473 [u'ANTT,CC,Parte II, maco 131, doc 54<><i>Inst...
2 1526 Spain / Uruguay Cape Verde Islands Cuba, port unspecified NaN NaN NaN 359.0 287.0 NaN Carega, Esteban (?) 11297 [u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i...
3 1526 Spain / Uruguay Cape Verde Islands Cuba, port unspecified NaN NaN NaN 359.0 287.0 NaN Carega, Esteban (?) 11298 [u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i...
4 1526 NaN Cape Verde Islands Caribbean (colony unspecified) NaN NaN NaN 359.0 287.0 NaN S Anton Leon, Juan de 42631 [u'Chaunus, 3: 162-63<><p>Chaunus, <em>xxxxxx<...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20736 1864 Spain / Uruguay Africa., port unspecified Cuba, port unspecified NaN NaN NaN 488.0 465.0 NaN Polaca NaN 46554 [u'AHNM, Ultramar, Leg. 3551, 6<><i>Archivo Hi...
20737 1865 Spain / Uruguay Africa., port unspecified Isla de Pinas NaN NaN NaN 152.0 145.0 Slave insurrection Gato NaN 4394 [u'IUP,ST,50/B/137<>Great Britain, <i>Irish Un...
20738 1865 NaN Africa., port unspecified Mariel NaN NaN NaN 780.0 650.0 NaN NaN 4395 [u'IUP,ST,50/B/144<>Great Britain, <i>Irish Un...
20739 1865 NaN Congo River Cuba, port unspecified NaN NaN NaN 1265.0 1004.0 NaN Cicerón Mesquita 5052 [u'IUP,ST,50/A/23-4<>Great Britain, <i>Irish U...
20740 1866 NaN Africa., port unspecified Cuba, port unspecified NaN NaN NaN 851.0 700.0 NaN NaN 4998 [u'IUP,ST,50/B/220<>Great Britain, <i>Irish Un...

20741 rows × 14 columns

There are a few important things to note about the DataFrame displayed here:

  • Index

    • The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.

    • By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

  • Truncation

    • The DataFrame is truncated, signaled by the ellipses in the middle ... of every column.

    • The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.

  • Rows x Columns

    • Pandas reports how many rows and columns are in this dataset at the bottom of the output (20,741 x 14 columns).

Display First n Rows

To look at the first n rows in a DataFrame, we can use a method called .head().

slave_voyages_df.head(10)
year_of_arrival flag place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
0 1520 NaN Portuguese Guinea San Juan NaN NaN NaN 324.0 259.0 NaN NaN 42987 [u'AGI,Patronato 175, r.9<><p><em>AG!</em> (Se...
1 1525 Portugal / Brazil Sao Tome Hispaniola, unspecified NaN NaN NaN 359.0 287.0 NaN S Maria de Bogoña Monteiro, Pero 46473 [u'ANTT,CC,Parte II, maco 131, doc 54<><i>Inst...
2 1526 Spain / Uruguay Cape Verde Islands Cuba, port unspecified NaN NaN NaN 359.0 287.0 NaN Carega, Esteban (?) 11297 [u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i...
3 1526 Spain / Uruguay Cape Verde Islands Cuba, port unspecified NaN NaN NaN 359.0 287.0 NaN Carega, Esteban (?) 11298 [u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i...
4 1526 NaN Cape Verde Islands Caribbean (colony unspecified) NaN NaN NaN 359.0 287.0 NaN S Anton Leon, Juan de 42631 [u'Chaunus, 3: 162-63<><p>Chaunus, <em>xxxxxx<...
5 1526 NaN Cape Verde Islands San Domingo (a) Santo Domingo NaN NaN NaN 359.0 287.0 NaN Santa Maria de Guadalupe Pabon, Francisco 42679 [u'Chaunus, 3: 162-63<><p>Chaunus, <em>xxxxxx<...
6 1526 Portugal / Brazil Sao Tome Spanish Caribbean, unspecified NaN NaN NaN 359.0 287.0 NaN NaN 46474 [u'ANTT,CC,Parte II, maco 131, doc 54<><i>Inst...
7 1527 Spain / Uruguay Cape Verde Islands Puerto Rico, port unspecified NaN NaN NaN 325.0 260.0 NaN Concepción Díaz, Alonso 99027 [u'SuedBadillo,57,75,76<><p>SuedBadillo, <em>x...
8 1532 Portugal / Brazil Sao Tome Spanish Caribbean, unspecified NaN NaN NaN 359.0 287.0 NaN S Antônio Afonso, Martim 11293 [u'Ryder,66<>Ryder, A. F. C., <i>Benin</i><i> ...
9 1532 NaN Cape Verde Islands San Juan NaN NaN NaN 25.0 20.0 NaN de Illanes, Manuel 28994 [u'Tanodi, 321-22<>Tanodi, Aurelio, <i>Documen...

Examine Data#

Shape#

To explicitly check for how many rows vs columns make up a dataset, we can use the .shape method.

slave_voyages_df.shape
(20741, 14)

There are 20,741 rows and 14 columns.

Data Types#

Just like Python has different data types, Pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the .dtypes method.

Pandas Data Type

Explanation

object

string

float64

float

int64

integer

datetime64

date time

slave_voyages_df.dtypes
year_of_arrival        int64
flag                  object
place_of_purchase     object
place_of_landing      object
percent_women        float64
percent_children     float64
percent_men          float64
total_embarked       float64
total_disembarked    float64
resistance_label      object
vessel_name           object
captain's_name        object
voyage_id              int64
sources               object
dtype: object

It’s important to always check the data types in your DataFrame. For example, sometimes numeric values will accidentally be interpreted as a string object. To perform calculations on this data, you would need to first convert that column from a string to an integer.

Columns#

We can also check the column names of the DataFrame with .columns

slave_voyages_df.columns
Index(['year_of_arrival', 'flag', 'place_of_purchase', 'place_of_landing',
       'percent_women', 'percent_children', 'percent_men', 'total_embarked',
       'total_disembarked', 'resistance_label', 'vessel_name',
       'captain's_name', 'voyage_id', 'sources'],
      dtype='object')

Summary Statistics#

slave_voyages_df.describe(include='all')
year_of_arrival flag place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
count 20741.000000 19583 20663 20741 2894.000000 2927.000000 2894.000000 20722.000000 20719.000000 372 20741 19396 20741.000000 20741
unique NaN 8 156 187 NaN NaN NaN NaN NaN 6 5849 12233 NaN 13754
top NaN Great Britain Africa., port unspecified Barbados, port unspecified NaN NaN NaN NaN NaN Slave insurrection Smith, John NaN [u'mettas,I<>Mettas, Jean, <i>R\xe9pertoire d...
freq NaN 10536 5999 2038 NaN NaN NaN NaN NaN 330 712 36 NaN 1134
mean 1752.014850 NaN NaN NaN 0.274198 0.231582 0.496648 295.050381 251.573966 NaN NaN NaN 42783.741671 NaN
std 59.702189 NaN NaN NaN 0.116513 0.149508 0.140324 147.997690 128.050439 NaN NaN NaN 32401.785320 NaN
min 1520.000000 NaN NaN NaN 0.000000 0.000000 0.000000 1.000000 1.000000 NaN NaN NaN 112.000000 NaN
25% 1724.000000 NaN NaN NaN 0.195265 0.115380 0.407460 194.000000 163.000000 NaN NaN NaN 17862.000000 NaN
50% 1765.000000 NaN NaN NaN 0.264110 0.215100 0.497890 282.000000 241.000000 NaN NaN NaN 31916.000000 NaN
75% 1792.000000 NaN NaN NaN 0.346150 0.321900 0.586765 368.000000 313.000000 NaN NaN NaN 78283.000000 NaN
max 1866.000000 NaN NaN NaN 1.000000 1.000000 1.000000 2024.000000 1700.000000 NaN NaN NaN 900206.000000 NaN

Missing Data#

The conceit of the archive is that it is the repository of answers, of knowable conclusions, of the data needed to explain or understand the past.

The reality, however, is that the archive is the troubled genesis of our always-failed effort to unravel the effects of the past on the present; rather than verifiable truths, the archive — and its silences — house the very questions that unsettle us.

—Jennifer Morgan, “Accounting for ‘The Most Excruciating Torment’”

Responsible data analysis requires understanding missing data. The Trans-Atlantic Slave Trade Database, as historian Jennifer Morgan writes, contains innumerable “silences” and “gaps.” These silences include the thoughts, feelings, and experiences of the enslaved African people on board the voyages — silences that cannot be found in the database itself.

There are other kinds of silences and gaps that can be detected in the database itself, however. For example, while some of the voyages in the the Trans-Atlantic Slave Trade Database recorded information about how many enslaved women and children were aboard, most did not. Yet focusing on the data that is there and analyzing trends in the missing data can help shed light on the history of gender and enslavement. The fact that most ship captains did not record gender information, Morgan argues, helps tell us about their “priorities”: “[W]e can assume that had it been financially significant to have more men than women that data would have been more scrupulously recorded.”

.isna() / .notna()#

Pandas has special ways of dealing with missing data. As you may have already noticed, blank rows in a CSV file show up as NaN in a Pandas DataFrame.

To filter and count the number of missing/not missing values in a dataset, we can use the special .isna() and .notna() methods on a DataFrame or Series object.

slave_voyages_df['percent_women'].notna()
0        False
1        False
2        False
3        False
4        False
         ...  
20736    False
20737    False
20738    False
20739    False
20740    False
Name: percent_women, Length: 20741, dtype: bool

The .isna() and .notna() methods return True/False pairs for each row, which we can use to filter the DataFrame for any rows that have information in a given column. For example, we can filter the DataFrame for only rows that have information about the percentage of enslaved women aboard the voyage.

slave_voyages_df[slave_voyages_df['percent_women'].notna()]
year_of_arrival flag place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
938 1613 Portugal / Brazil Luanda Santo Tomas 0.30556 0.20588 0.69444 362.0 290.0 NaN NS de Nazareth Gómez, Juan 47352 [u'AGI-Esc 38B, pieza 2, folios 427r-427v<><p>...
1044 1619 Portugal / Brazil Luanda Veracruz 0.21127 0.21596 0.57277 349.0 279.0 NaN S Antônio Acosta, Jacome de 29248 [u'Vila Vilar,Cuadro3<><p>Vila Vilar, Enriquet...
1115 1620 Portugal / Brazil Luanda Buenos Aires 0.13043 0.29193 0.57764 381.0 304.0 NaN NS de Consolación Acosta, Gonçalo 29561 [u'AGI, Indiferente General, 2795<><p><em>AG!<...
1117 1620 NaN Luanda Cumana 0.29570 0.33571 0.70430 421.0 337.0 NaN NS de Rocha Sosa, Nicolás de<br/> Estéves, Domingo<br/> Ro... 29941 [u'AGI, Contratacion, 2881<><p><em>AG!</em> (S...
1334 1628 Portugal / Brazil West Central Africa and St. Helena, port unspe... Spanish Circum-Caribbean,unspecified 0.16908 0.58454 0.24638 303.0 242.0 NaN S Pedro Silva, Jacinto da 29568 [u'AGI, Indiferente General, 2796<><p><em>AG!<...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20295 1841 Portugal / Brazil Rio Pongo Cuba, port unspecified 0.20548 0.21233 0.58219 324.0 292.0 NaN Segunda Rosália Peirano, Francisco 2078 [u'PP,1845,XLIX:593-633<>Great Britain, <i>Par...
20321 1841 Spain / Uruguay Africa., port unspecified Bahamas, port unspecified 0.15758 0.31548 0.53939 215.0 193.0 NaN Trovadore Velasea, de Bonita 5503 [u'Dalleo,24<>Dalleo, Peter D.,"Africans in th...
20429 1850 NaN Benguela British Caribbean, colony unspecified 0.00000 1.00000 0.00000 94.0 74.0 NaN Amélia Oliveira, José 4674 [u'IUP,ST,38/A/208<>Great Britain, <i>Irish Un...
20498 1854 U.S.A. Whydah Bahia Honda 0.45455 0.08333 0.54545 600.0 584.0 NaN Grey Eagle Darnaud 4190 [u'FO84/965,Crawford,55.02.07,enc<><p><em>BNA<...
20555 1857 NaN Cabinda Kingston 0.08943 0.00000 0.91057 500.0 362.0 NaN Zeldina NaN 4229 [u'IUP,ST,44/A/44,161<>Great Britain, <i>Irish...

2894 rows × 14 columns

The data is now filtered to only include the 2,894 rows with information about how many women were aboard the voyage.

To explicitly count the number of blank rows, we can use the .value_counts() method.

slave_voyages_df['percent_women'].isna().value_counts()
True     17847
False     2894
Name: percent_women, dtype: int64

There are 17,874 that do not contain information about the number of enslaved women on the voyage (isna = True) and 2,894 rows that do contain this information (isna = False).

To quickly transform these numbers into percentages, we can set the normalize= parameter to True.

slave_voyages_df['percent_women'].isna().value_counts(normalize=True)
True     0.86047
False    0.13953
Name: percent_women, dtype: float64

About 14% of rows in this dataset have information about the number of enslaved women on the voyage while 86% do not.

.count()#

Because the .count() method always excludes NaN values, we can also count the number of values in each column and divide by the total number of rows in each column (len()) to find the percentage of not blank data in every column.

slave_voyages_df.count() / len(slave_voyages_df)
year_of_arrival      1.000000
flag                 0.944169
place_of_purchase    0.996239
place_of_landing     1.000000
percent_women        0.139530
percent_children     0.141121
percent_men          0.139530
total_embarked       0.999084
total_disembarked    0.998939
resistance_label     0.017935
vessel_name          1.000000
captain's_name       0.935153
voyage_id            1.000000
sources              1.000000
dtype: float64

For example, 100% of the rows in the columns “year_of_arrival” contain information, while 2% of the rows in the column “resistance_label” contain information. The “resistance_label” indicates whether there is a record of the enslaved Africans aboard the voyage staging some form of resistance.

.fillna()#

If we wanted, we could fill the NaN values in the DataFrame with a different value by using the .fillna() method.

slave_voyages_df['percent_women'].fillna('no gender information recorded')
0        no gender information recorded
1        no gender information recorded
2        no gender information recorded
3        no gender information recorded
4        no gender information recorded
                      ...              
20736    no gender information recorded
20737    no gender information recorded
20738    no gender information recorded
20739    no gender information recorded
20740    no gender information recorded
Name: percent_women, Length: 20741, dtype: object

Rename Columns#

We can rename columns with the .rename() method and the columns= parameter. For example, we can rename the “flag” column “national_affiliation.”

slave_voyages_df.rename(columns={'flag': 'national_affiliation'})
year_of_arrival national_affiliation place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
0 1520 NaN Portuguese Guinea San Juan NaN NaN NaN 324.0 259.0 NaN NaN 42987 [u'AGI,Patronato 175, r.9<><p><em>AG!</em> (Se...
1 1525 Portugal / Brazil Sao Tome Hispaniola, unspecified NaN NaN NaN 359.0 287.0 NaN S Maria de Bogoña Monteiro, Pero 46473 [u'ANTT,CC,Parte II, maco 131, doc 54<><i>Inst...
2 1526 Spain / Uruguay Cape Verde Islands Cuba, port unspecified NaN NaN NaN 359.0 287.0 NaN Carega, Esteban (?) 11297 [u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i...
3 1526 Spain / Uruguay Cape Verde Islands Cuba, port unspecified NaN NaN NaN 359.0 287.0 NaN Carega, Esteban (?) 11298 [u'Pike,60-1,172<>Pike, Ruth, <i>Enterprise</i...
4 1526 NaN Cape Verde Islands Caribbean (colony unspecified) NaN NaN NaN 359.0 287.0 NaN S Anton Leon, Juan de 42631 [u'Chaunus, 3: 162-63<><p>Chaunus, <em>xxxxxx<...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20736 1864 Spain / Uruguay Africa., port unspecified Cuba, port unspecified NaN NaN NaN 488.0 465.0 NaN Polaca NaN 46554 [u'AHNM, Ultramar, Leg. 3551, 6<><i>Archivo Hi...
20737 1865 Spain / Uruguay Africa., port unspecified Isla de Pinas NaN NaN NaN 152.0 145.0 Slave insurrection Gato NaN 4394 [u'IUP,ST,50/B/137<>Great Britain, <i>Irish Un...
20738 1865 NaN Africa., port unspecified Mariel NaN NaN NaN 780.0 650.0 NaN NaN 4395 [u'IUP,ST,50/B/144<>Great Britain, <i>Irish Un...
20739 1865 NaN Congo River Cuba, port unspecified NaN NaN NaN 1265.0 1004.0 NaN Cicerón Mesquita 5052 [u'IUP,ST,50/A/23-4<>Great Britain, <i>Irish U...
20740 1866 NaN Africa., port unspecified Cuba, port unspecified NaN NaN NaN 851.0 700.0 NaN NaN 4998 [u'IUP,ST,50/B/220<>Great Britain, <i>Irish Un...

20741 rows × 14 columns

Renaming the “flag” column as above will only momentarily change that column’s name, however. If we display our DataFrame, we will see that the column name has not changed permamently.

slave_voyages_df.head(1)
year_of_arrival flag place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
0 1520 NaN Portuguese Guinea San Juan NaN NaN NaN 324.0 259.0 NaN NaN 42987 [u'AGI,Patronato 175, r.9<><p><em>AG!</em> (Se...

To save changes in the DataFrame, we need to reassign the DataFrame to the same variable.

slave_voyages_df = slave_voyages_df.rename(columns={'flag': 'national_affiliation'})
slave_voyages_df.head(1)
year_of_arrival national_affiliation place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id sources
0 1520 NaN Portuguese Guinea San Juan NaN NaN NaN 324.0 259.0 NaN NaN 42987 [u'AGI,Patronato 175, r.9<><p><em>AG!</em> (Se...

Drop Columns#

We can remove a column from the DataFrame with the .drop() method and the column name.

slave_voyages_df = slave_voyages_df.drop(columns="sources")
slave_voyages_df.columns
Index(['year_of_arrival', 'flag', 'place_of_purchase', 'place_of_landing',
       'percent_women', 'percent_children', 'percent_men', 'total_embarked',
       'total_disembarked', 'resistance_label', 'vessel_name',
       'captain's_name', 'voyage_id'],
      dtype='object')

Add Columns#

To add a column, we simply put a new column name in square brackets and set it equal to whatever we want the new column to be.

For example, if we wanted to create new columns for the total women and men aboard each voyage, we could set them equal to the product of the “total_disembarked” column * the “percent_women” / “percent_men” columns.

slave_voyages_df['total_women'] = slave_voyages_df['total_embarked'] * slave_voyages_df['percent_women']
slave_voyages_df['total_men'] = slave_voyages_df['total_embarked'] * slave_voyages_df['percent_men']

If we scroll all the way to the right side of the DataFrame, we can see that these columns have been added.

slave_voyages_df.head(1)
year_of_arrival national_affiliation place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id total_women total_men
0 1520 NaN Portuguese Guinea San Juan NaN NaN NaN 324.0 259.0 NaN NaN 42987 NaN NaN

Sort Columns#

We can sort a DataFrame with the .sort_values() method, inside of which we include the parameter by= and indicate the name of the column we want to sort by (written in quotation marks).

For example, we can sort the DataFrame by the voyages that had the largest proportion of enslaved women aboard.

slave_voyages_df.sort_values(by='percent_women', ascending=False)
year_of_arrival national_affiliation place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id total_women total_men
19042 1819 Spain / Uruguay Luanda Havana 1.00000 0.00000 0.00000 82.0 73.0 NaN S José Diligente Estepa 14378 82.00000 0.00000
4269 1715 France Senegambia and offshore Atlantic, port unspeci... Cap Francais 0.74747 0.16162 0.09091 120.0 118.0 NaN Elisabeth Beliveau 31883 89.69640 10.90920
1741 1659 Netherlands Bonny Jamaica, port unspecified 0.72500 0.05882 0.27500 195.0 85.0 NaN S Jan Veer, Adriaen Blaes van der 21566 141.37500 53.62500
19047 1819 Spain / Uruguay Africa., port unspecified Havana 0.71429 0.00000 0.28571 178.0 161.0 NaN Descubridor Benoit<br/> Valdes 14868 127.14362 50.85638
18950 1818 Spain / Uruguay Africa., port unspecified Havana 0.69355 0.30645 0.00000 69.0 62.0 NaN Vencedor Iturrondo 14857 47.85495 0.00000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20736 1864 Spain / Uruguay Africa., port unspecified Cuba, port unspecified NaN NaN NaN 488.0 465.0 NaN Polaca NaN 46554 NaN NaN
20737 1865 Spain / Uruguay Africa., port unspecified Isla de Pinas NaN NaN NaN 152.0 145.0 Slave insurrection Gato NaN 4394 NaN NaN
20738 1865 NaN Africa., port unspecified Mariel NaN NaN NaN 780.0 650.0 NaN NaN 4395 NaN NaN
20739 1865 NaN Congo River Cuba, port unspecified NaN NaN NaN 1265.0 1004.0 NaN Cicerón Mesquita 5052 NaN NaN
20740 1866 NaN Africa., port unspecified Cuba, port unspecified NaN NaN NaN 851.0 700.0 NaN NaN 4998 NaN NaN

20741 rows × 15 columns

By default, Pandas will sort in “ascending” order, from the smallest value to the largest value. If we want to sort the largest values first, we need to include another parameter ascending=False.

Because the DataFrame is truncated when it has more than 100 rows, we can use a Python list slice to view the top 30 (or any number less than 100) voyages with enslaved women aboard.

slave_voyages_df.sort_values(by='percent_women', ascending=False)[:30]
year_of_arrival national_affiliation place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id total_women total_men
19042 1819 Spain / Uruguay Luanda Havana 1.00000 0.00000 0.00000 82.0 73.0 NaN S José Diligente Estepa 14378 82.00000 0.00000
4269 1715 France Senegambia and offshore Atlantic, port unspeci... Cap Francais 0.74747 0.16162 0.09091 120.0 118.0 NaN Elisabeth Beliveau 31883 89.69640 10.90920
1741 1659 Netherlands Bonny Jamaica, port unspecified 0.72500 0.05882 0.27500 195.0 85.0 NaN S Jan Veer, Adriaen Blaes van der 21566 141.37500 53.62500
19047 1819 Spain / Uruguay Africa., port unspecified Havana 0.71429 0.00000 0.28571 178.0 161.0 NaN Descubridor Benoit<br/> Valdes 14868 127.14362 50.85638
18950 1818 Spain / Uruguay Africa., port unspecified Havana 0.69355 0.30645 0.00000 69.0 62.0 NaN Vencedor Iturrondo 14857 47.85495 0.00000
19037 1819 Spain / Uruguay Ambriz Havana 0.68456 0.00000 0.31544 164.0 149.0 NaN Supremo Habanero Blanco, Pedro 14372 112.26784 51.73216
4894 1721 France Benin Martinique, port unspecified 0.67797 0.15254 0.16949 69.0 59.0 NaN Saint Jean Triomphant Gaulart, J 33087 46.77993 11.69481
4150 1714 Great Britain Calabar Saint John (Antigua) 0.66667 0.16667 0.33333 360.0 240.0 NaN Florida Payne, Samuel 75489 240.00120 119.99880
6259 1734 Netherlands Jacquin Suriname 0.66480 0.09078 0.24441 716.0 374.0 NaN Rusthof Bovenmans, Anthony<br/> Theunissen, Willem 10278 475.99680 174.99756
1823 1664 Great Britain New Calabar Barbados, port unspecified 0.65217 0.08696 0.26087 242.0 168.0 NaN Elizabeth Francis, William 9558 157.82514 63.13054
1799 1663 Great Britain Ardra Barbados, port unspecified 0.64667 0.00000 0.35333 373.0 150.0 NaN Blackamore Bowles 9552 241.20791 131.79209
15471 1792 Great Britain Calabar Grenada, port unspecified 0.64646 0.06965 0.29798 200.0 135.0 NaN Fame Jenkins, William 18138 129.29200 59.59600
1824 1664 Great Britain Bight of Biafra and Gulf of Guinea Islands, po... Jamaica, port unspecified 0.64324 0.12973 0.22703 267.0 185.0 NaN Allepine Thwaites, William 9559 171.74508 60.61701
5049 1723 Great Britain Cabinda Jamaica, port unspecified 0.64141 0.26768 0.09091 201.0 198.0 NaN Royal African Packet Cornwell, Henry 76460 128.92341 18.27291
2151 1677 Great Britain Calabar Barbados, port unspecified 0.63816 0.03289 0.32895 219.0 152.0 NaN Lenox Wilkins, Amos 9980 139.75704 72.04005
17618 1804 Spain / Uruguay Africa., port unspecified Buenos Aires 0.63333 0.05000 0.31667 160.0 81.0 NaN S Ana Amaro, Rodriguez de Oliveira 96099 101.33280 50.66720
19040 1819 Spain / Uruguay Luanda Havana 0.62201 0.00000 0.37799 219.0 209.0 NaN Clarita Carballo, Alberto 14375 136.22019 82.77981
2318 1681 Great Britain Gold Coast, port unspecified Jamaica, port unspecified 0.62048 0.07831 0.30120 400.0 170.0 NaN African Mingham, John 9918 248.19200 120.48000
4364 1716 France Goree Leogane 0.62025 0.20736 0.37975 411.0 402.0 NaN Duc d'Orléans Villers, Chevalier de 31886 254.92275 156.07725
5259 1725 France Senegambia and offshore Atlantic, port unspeci... Cap Francais 0.61353 0.05314 0.33333 210.0 207.0 NaN Prothée Hantier, J 32873 128.84130 69.99930
4773 1720 France Formosa Martinique, port unspecified 0.61207 0.25862 0.12931 215.0 176.0 Slave insurrection Don de Dieu Masson Du Bocage, Guillaume 33119 131.59505 27.80165
2591 1686 Great Britain New Calabar Newcastle (Nevis) 0.60656 0.14754 0.24590 267.0 111.0 NaN Francis Wimple, Ralph 9838 161.95152 65.65530
2182 1678 Great Britain Calabar Jamaica, port unspecified 0.60526 0.04605 0.34868 219.0 152.0 NaN Vine Coleman, John 9987 132.55194 76.36092
5543 1727 Great Britain Africa., port unspecified Rappahannock 0.60440 0.35000 0.39560 161.0 140.0 NaN John and Betty Denton, William 92352 97.30840 63.69160
2540 1685 Great Britain Cape Coast Castle Jamaica, port unspecified 0.59917 0.00000 0.40083 310.0 242.0 NaN Jefferie Daile, Benjamin 9669 185.74270 124.25730
4736 1719 France Africa., port unspecified Cayenne 0.59524 0.16667 0.23810 52.0 42.0 NaN Fidèle Bonneau, Jean 30111 30.95248 12.38120
2327 1681 Great Britain New Calabar Barbados, port unspecified 0.59231 0.06923 0.33846 232.0 130.0 NaN Hope Grible, John 9927 137.41592 78.52272
2377 1682 Great Britain Calabar Jamaica, port unspecified 0.58879 0.08411 0.32710 164.0 114.0 NaN Arthur Penny, George 9898 96.56156 53.64440
2376 1682 Great Britain Calabar Newcastle (Nevis) 0.58696 0.11957 0.29348 146.0 119.0 NaN Eaglet Waugh, John 9896 85.69616 42.84808
1833 1664 Great Britain Calabar Barbados, port unspecified 0.58549 0.15544 0.25907 278.0 193.0 NaN Success Duke, Abraham 9584 162.76622 72.02146

If we want to sort a Series object, we don’t need to use the by= paramter.

slave_voyages_df['total_women'].sort_values(ascending=False)
6259     475.996800
18822    316.640164
2264     311.833480
6163     303.000180
3129     301.764000
            ...    
20736           NaN
20737           NaN
20738           NaN
20739           NaN
20740           NaN
Name: total_women, Length: 20741, dtype: float64

Calculate Columns#

We can do different calculations on columns with built-in Pandas functions. These calculations will ignore NaN values.

Pandas calculations

Explanation

.count()

Number of observations

.sum()

Sum of values

.mean()

Mean of values

.median()

Median of values

.min()

Minimum

.max()

Maximum

.mode()

Mode

.std()

Unbiased standard deviation

For example, to find the average proprotion of enslaved women aboard the voyages (for voyages that have this information), we can use the .mean() method.

slave_voyages_df['percent_women'].mean()
0.27419760653973735

There were on average 27% enslaved women aboard the voyages for voyages that recorded this information.

slave_voyages_df['percent_women'].max()
1.0

The highest percentage of women aboard the slave voyages was 100%. We can use this calculation as a filter to identify the voyage(s) with this maximum value.

slave_voyages_df[slave_voyages_df['percent_women'] == slave_voyages_df['percent_women'].max()]
year_of_arrival national_affiliation place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id total_women total_men
19042 1819 Spain / Uruguay Luanda Havana 1.0 0.0 0.0 82.0 73.0 NaN S José Diligente Estepa 14378 82.0 0.0

According to the Trans-Atlantic Slave Trade Database, the 1819 voyage of the S José Diligente had 100% enslaved women aboard.

As demonstrated previously, we can also perform calculations with columns themselves.

(slave_voyages_df['total_embarked'] * slave_voyages_df['percent_women']).max()
475.99679999999995

Groupby Columns#

The Pandas function.groupby() allows us to group data and perform calculations on the groups.

For example, Jennifer Morgan writes about how some nations recorded more information about the gender of the enslaved people aboard their voyages than other nations did. To see the breakdown of gender information by nation, we can use a .groupby() function.

The first step to using groupby is to type the name of the DataFrame followed by .groupby() with the column we’d like to aggregate based on, such as “national_affiliation.”

slave_voyages_df.groupby('national_affiliation')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdd2d0c2150>

This action will created a GroupBy object. We can perform calculations on this grouped data, such as counting the number of non-blank values in each column for each nation.

slave_voyages_df.groupby('national_affiliation').count()
year_of_arrival place_of_purchase place_of_landing percent_women percent_children percent_men total_embarked total_disembarked resistance_label vessel_name captain's_name voyage_id total_women total_men
national_affiliation
Denmark / Baltic 290 290 290 119 119 119 290 290 8 290 163 290 119 119
France 3381 3377 3381 708 709 708 3381 3381 121 3381 3289 3381 708 708
Great Britain 10536 10530 10536 1123 1151 1123 10526 10525 152 10536 10226 10536 1123 1123
Netherlands 1389 1341 1389 200 201 200 1387 1387 51 1389 1316 1389 200 200
Other 4 4 4 0 0 0 4 4 0 4 2 4 0 0
Portugal / Brazil 1009 1009 1009 42 48 42 1009 1009 0 1009 874 1009 42 42
Spain / Uruguay 1528 1524 1528 468 465 468 1528 1528 4 1528 1323 1528 468 468
U.S.A. 1446 1436 1446 223 223 223 1443 1442 31 1446 1283 1446 223 223

We can also isolate only the “percent_women” column.

slave_voyages_df.groupby('national_affiliation').count()['percent_women']
national_affiliation
Denmark / Baltic      119
France                708
Great Britain        1123
Netherlands           200
Other                   0
Portugal / Brazil      42
Spain / Uruguay       468
U.S.A.                223
Name: percent_women, dtype: int64
slave_voyages_df.groupby('national_affiliation')['percent_women'].count().sort_values(ascending=False)
national_affiliation
Great Britain        1123
France                708
Spain / Uruguay       468
U.S.A.                223
Netherlands           200
Denmark / Baltic      119
Portugal / Brazil      42
Other                   0
Name: percent_women, dtype: int64
slave_voyages_df.groupby('national_affiliation')['percent_women'].count()\
.sort_values(ascending=False).plot(kind='bar', title='Trans-Atlantic Slave Trade (Americas): \n Slave Voyages with Recorded Gender Information')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd2bda9750>
../_images/a596e6ca1c03e258e362a46f105c6639938d9157617fe8a79d01458cc77319f1.png

Make Time Series with Groupby#

To make a time series, we would typically want to convert our date column into datetime values rather than integers.

slave_voyages_df['year_of_arrival'].dtype
dtype('int64')

Datetime values allow us to do special things that we can’t do with regular integers and floats, such as extract just the year, month, week, day, or second from any date or aggregate based on any of the above.

However, we can also make some simple time series plots just by grouping by the year column and performing calculations on those year groupings, such as calculating the average percentage of enslaved women aboard the voyages over time.

slave_voyages_df.groupby('year_of_arrival')['total_women'].sum()
year_of_arrival
1520    0.0
1525    0.0
1526    0.0
1527    0.0
1532    0.0
       ... 
1862    0.0
1863    0.0
1864    0.0
1865    0.0
1866    0.0
Name: total_women, Length: 330, dtype: float64
total_women_by_year = slave_voyages_df.groupby('year_of_arrival')['total_women'].sum()
total_women_by_year.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fbc383adb50>
../_images/b4b4f8fd3dd0b890d69002de86ecec4b863eb5519f5a97d1380d6888703b1c25.png
total_women_by_year.plot(kind='line', title="Trans-Atlantic Slave Trade (Americas):\nTotal Number of Enslaved Women on Voyages")
<matplotlib.axes._subplots.AxesSubplot at 0x7fbc39cfa110>
../_images/d8ecb65a7ef1c8268aca1e578b934bc2a3fe87c9cd44ccdd75a04da0fbfeed48.png

We can put different plots on the same axes by assigning one of the plots to the variable ax, short for axes, and then using ax=ax in the other plot to explicitly put it on the same axes.

total_men_by_year = slave_voyages_df.groupby('year_of_arrival')['total_men'].sum()
ax = total_women_by_year.plot(kind='line', legend= True,
                              title="Trans-Atlantic Slave Trade (Americas):\nTotal Number of Enslaved Women on Voyages")
total_men_by_year.plot(ax=ax, legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbc39ec9c90>
../_images/64d37a6b07a773173c7607e82fc216f21b17b15488fc1df3753c46dc668205fb.png

We can change the labels in a legend by using the label= parameter.

ax = total_women_by_year.plot(kind='line', label="Total Women", legend= True, title="Trans-Atlantic Slave Trade (Americas):\nTotal Number of Enslaved Women on Voyages")
total_men_by_year.plot(ax=ax, label="Total Men", legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbc3a004710>
../_images/d6747e13a8d8527748578dee762037ed0d16b00629a644cf520f1655ff0a9a8b.png

Finally, we can also add in the total number of enslaved people who embarked on the voyages, offering a perspective of how much gender information we have about the voyages compared to the total number of voyages.

total_embarked_by_year = slave_voyages_df.groupby('year_of_arrival')['total_embarked'].sum()
ax = total_women_by_year.plot(kind='line', label="Total Women", legend= True, title="Trans-Atlantic Slave Trade (Americas):\nTotal Number of Enslaved Women on Voyages")
total_men_by_year.plot(ax=ax, label="Total Men", legend=True)
total_embarked_by_year.plot(ax=ax, label="Total Embarked", legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbc3a0f8f10>
../_images/75a3243fa227a435e058dadb40a7575ee06189f7193c86009092f04499096c2e.png

Save Plots#

To save a plot as an image file or PDF file, we can again assign the plot to a variable called ax, short for axes.

Then we can use ax.figure.savefig('FILE-NAME.png') or ax.figure.savefig('FILE-NAME.pdf').

ax = total_women_by_year.plot(kind='line', label="Total Women", legend= True, title="Trans-Atlantic Slave Trade (Americas):\nTotal Number of Enslaved Women on Voyages")
total_men_by_year.plot(ax=ax, label="Total Men", legend=True)
total_embarked_by_year.plot(ax=ax, label="Total Embarked", legend=True)

ax.figure.savefig('Trans-Atlantic-Slave-Trade_Gender-Info.png')

Prevent Labels From Getting Cut Off#

If labels are getting cut off in your image, you can explicitly import matplotlib.pyplot (the data viz Python library that Pandas .plot()s are built on) and use the tight_layout() function:

import matplotlib.pyplot as plt

ax = total_women_by_year.plot(kind='line', label="Total Women", legend= True, title="Trans-Atlantic Slave Trade (Americas):\nTotal Number of Enslaved Women on Voyages")
total_men_by_year.plot(ax=ax, label="Total Men", legend=True)
total_embarked_by_year.plot(ax=ax, label="Total Embarked", legend=True)

plt.tight_layout()
ax.figure.savefig('Trans-Atlantic-Slave-Trade_Gender-Info.png')