Visualize Author Gender By Year (Solution)

seaborn
plotly
interactive
line-plot
bar-plot
groupby
python
solution
Published

February 26, 2024

Import pandas

Code
import pandas as pd

Read in CSV

Code
aa_df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/aa-periodical-poetry/AAPADA-Periodical-Poetry_1900-1928.csv")

View first 5 rows

Code
aa_df.head()
title author (first last) author (last name) text month year venue edited by form (if known) gender (if known) ... Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Unnamed: 29 Unnamed: 30 Unnamed: 31
0 Neath the Crown and Maple Leaf A.R. Abbott Abbott A SIGH is breathed from million heart.\nFrom S... March 1901.0 Colored American Walter W. Wallace Elegy, Common Measure male ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 The Friend I Love Ada M. Wright Wright I LOVE a friend whose cheering voice\nCan soot... February 1901.0 Colored American Walter W. Wallace NaN female ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Black Madonna Albert Rice Rice Not as the white nations\n know thee\n ... October 1926.0 Palms Countee Cullen NaN male ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 The Baby Alexander Seymour Seymour Sweetest little t'ing on earth,\nNo one knows ... March 1927.0 The Messenger A. Philip Randolph, Chandler Owen, George S. S... NaN male ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 April Is On The Way Alice Dunbar Nelson Dunbar-Nelson April is on the way! \n \nI saw the scarlet f... December 1927.0 Ebony and Topaz Charles S. Johnson NaN female ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 32 columns

Convert year to datetime value

Code
aa_df['year'] = pd.to_datetime(aa_df['year'], format = "%Y")

#aa_df.set_index('year').reindex(pd.date_range('1900-01-01', '1930-12-31', freq='Y'))
      
# Sort years and return unique years to check for missing data
aa_df['year'].sort_values().unique()
<DatetimeArray>
['1900-01-01 00:00:00', '1901-01-01 00:00:00', '1902-01-01 00:00:00',
 '1904-01-01 00:00:00', '1905-01-01 00:00:00', '1906-01-01 00:00:00',
 '1907-01-01 00:00:00', '1908-01-01 00:00:00', '1911-01-01 00:00:00',
 '1912-01-01 00:00:00', '1913-01-01 00:00:00', '1914-01-01 00:00:00',
 '1915-01-01 00:00:00', '1916-01-01 00:00:00', '1917-01-01 00:00:00',
 '1918-01-01 00:00:00', '1919-01-01 00:00:00', '1920-01-01 00:00:00',
 '1921-01-01 00:00:00', '1922-01-01 00:00:00', '1923-01-01 00:00:00',
 '1924-01-01 00:00:00', '1925-01-01 00:00:00', '1926-01-01 00:00:00',
 '1927-01-01 00:00:00', '1928-01-01 00:00:00',                 'NaT']
Length: 27, dtype: datetime64[ns]

It looks like we have some missing year values… we may need to add them back in shortly.

Group by year, count instances of author by gender

Code
aa_df.groupby('year')['gender (if known)'].value_counts()
year        gender (if known)
1900-01-01  male                  5
            female                1
1901-01-01  male                 31
            female                3
1902-01-01  male                  3
1904-01-01  male                  8
            Male                  2
1905-01-01  male                  3
1906-01-01  male                  6
            female                2
1907-01-01  male                  3
1908-01-01  male                 10
            female                1
1911-01-01  male                  5
1912-01-01  male                  3
            female                2
1913-01-01  male                  7
1914-01-01  male                  6
            female                1
1915-01-01  male                  8
            female                3
1916-01-01  female               11
            male                  8
1917-01-01  male                 11
            female               11
1918-01-01  male                  8
            female                4
1919-01-01  male                 14
            female                7
1920-01-01  male                  8
            female                7
1921-01-01  male                 30
            female                6
1922-01-01  female                8
            male                  5
1923-01-01  male                 23
            female                8
1924-01-01  female               32
            male                 26
1925-01-01  male                 57
            female               27
1926-01-01  male                 47
            female               30
1927-01-01  male                 94
            female               58
1928-01-01  male                 66
            female               33
Name: count, dtype: int64

Make this grouping into a dataframe

Code
aa_gender_by_year = aa_df.groupby('year')['gender (if known)'].value_counts().reset_index()

#aa_gender_by_year['year'] = aa_gender_by_year['year'].astype(int)

aa_gender_by_year
year gender (if known) count
0 1900-01-01 male 5
1 1900-01-01 female 1
2 1901-01-01 male 31
3 1901-01-01 female 3
4 1902-01-01 male 3
5 1904-01-01 male 8
6 1904-01-01 Male 2
7 1905-01-01 male 3
8 1906-01-01 male 6
9 1906-01-01 female 2
10 1907-01-01 male 3
11 1908-01-01 male 10
12 1908-01-01 female 1
13 1911-01-01 male 5
14 1912-01-01 male 3
15 1912-01-01 female 2
16 1913-01-01 male 7
17 1914-01-01 male 6
18 1914-01-01 female 1
19 1915-01-01 male 8
20 1915-01-01 female 3
21 1916-01-01 female 11
22 1916-01-01 male 8
23 1917-01-01 male 11
24 1917-01-01 female 11
25 1918-01-01 male 8
26 1918-01-01 female 4
27 1919-01-01 male 14
28 1919-01-01 female 7
29 1920-01-01 male 8
30 1920-01-01 female 7
31 1921-01-01 male 30
32 1921-01-01 female 6
33 1922-01-01 female 8
34 1922-01-01 male 5
35 1923-01-01 male 23
36 1923-01-01 female 8
37 1924-01-01 female 32
38 1924-01-01 male 26
39 1925-01-01 male 57
40 1925-01-01 female 27
41 1926-01-01 male 47
42 1926-01-01 female 30
43 1927-01-01 male 94
44 1927-01-01 female 58
45 1928-01-01 male 66
46 1928-01-01 female 33

Add missing years

Code
# Create a dataframe with a complete range of years from 1900 to 1930
years = pd.DataFrame({'year': pd.date_range(start='1900', end='1930', freq='YS')})

# Merge with existing dataframe
aa_gender_by_year = pd.merge(years, aa_gender_by_year, on='year', how='left')

# Fill NaN values
#aa_gender_by_year['gender (if known)'] = aa_gender_by_year['gender (if known)'].fillna("no author recorded")

#aa_gender_by_year['count'] = aa_gender_by_year['count'].fillna(0)


# Sort years and return unique years to check for missing data
aa_gender_by_year['year'].sort_values().unique()

aa_gender_by_year
year gender (if known) count
0 1900-01-01 male 5.0
1 1900-01-01 female 1.0
2 1901-01-01 male 31.0
3 1901-01-01 female 3.0
4 1902-01-01 male 3.0
5 1903-01-01 NaN NaN
6 1904-01-01 male 8.0
7 1904-01-01 Male 2.0
8 1905-01-01 male 3.0
9 1906-01-01 male 6.0
10 1906-01-01 female 2.0
11 1907-01-01 male 3.0
12 1908-01-01 male 10.0
13 1908-01-01 female 1.0
14 1909-01-01 NaN NaN
15 1910-01-01 NaN NaN
16 1911-01-01 male 5.0
17 1912-01-01 male 3.0
18 1912-01-01 female 2.0
19 1913-01-01 male 7.0
20 1914-01-01 male 6.0
21 1914-01-01 female 1.0
22 1915-01-01 male 8.0
23 1915-01-01 female 3.0
24 1916-01-01 female 11.0
25 1916-01-01 male 8.0
26 1917-01-01 male 11.0
27 1917-01-01 female 11.0
28 1918-01-01 male 8.0
29 1918-01-01 female 4.0
30 1919-01-01 male 14.0
31 1919-01-01 female 7.0
32 1920-01-01 male 8.0
33 1920-01-01 female 7.0
34 1921-01-01 male 30.0
35 1921-01-01 female 6.0
36 1922-01-01 female 8.0
37 1922-01-01 male 5.0
38 1923-01-01 male 23.0
39 1923-01-01 female 8.0
40 1924-01-01 female 32.0
41 1924-01-01 male 26.0
42 1925-01-01 male 57.0
43 1925-01-01 female 27.0
44 1926-01-01 male 47.0
45 1926-01-01 female 30.0
46 1927-01-01 male 94.0
47 1927-01-01 female 58.0
48 1928-01-01 male 66.0
49 1928-01-01 female 33.0
50 1929-01-01 NaN NaN
51 1930-01-01 NaN NaN

Visualize with seaborn

Code
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Create the line plot
ax = sns.lineplot(data=aa_gender_by_year, x="year", y="count", hue="gender (if known)")

# Set x-axis major ticks to every 5 years
ax.xaxis.set_major_locator(mdates.YearLocator(5))

# Format the x-axis labels to show the year only
#ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

Code
from matplotlib.ticker import MultipleLocator

plt.figure(figsize=(13.7, 8.27))

# Create the bar plot
ax = sns.barplot(data=aa_gender_by_year, 
# pull out just the year
x=aa_gender_by_year["year"].dt.year, y="count", hue="gender (if known)")

# Set x-axis major ticks every 1 year
ax.xaxis.set_major_locator(MultipleLocator(1))

# Set x-axis labels to the desired format (only show every 5th year)
ax.set_xticklabels([date.year for date in pd.date_range(start='1900', end='1930', freq='YS')], rotation=45)

plt.show()
C:\Users\juang\AppData\Local\Temp\ipykernel_38872\3665439903.py:14: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

Make interactive visualizations with plotly

Code
import plotly.express as px

# Create the interactive line plot with Plotly
fig = px.line(aa_gender_by_year, 
              x='year', 
              y='count', 
              color='gender (if known)',
              title='Author Gender by Year')
fig.show()
Code
fig = px.bar(aa_gender_by_year, 
             x='year', 
             y='count', 
             color='gender (if known)',
             title='Gender Count by Year')


fig.show()