Code
import pandas as pd
February 26, 2024
title | author (first last) | author (last name) | text | month | year | venue | edited by | form (if known) | gender (if known) | ... | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Unnamed: 29 | Unnamed: 30 | Unnamed: 31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Neath the Crown and Maple Leaf | A.R. Abbott | Abbott | A SIGH is breathed from million heart.\nFrom S... | March | 1901.0 | Colored American | Walter W. Wallace | Elegy, Common Measure | male | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | The Friend I Love | Ada M. Wright | Wright | I LOVE a friend whose cheering voice\nCan soot... | February | 1901.0 | Colored American | Walter W. Wallace | NaN | female | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Black Madonna | Albert Rice | Rice | Not as the white nations\n know thee\n ... | October | 1926.0 | Palms | Countee Cullen | NaN | male | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | The Baby | Alexander Seymour | Seymour | Sweetest little t'ing on earth,\nNo one knows ... | March | 1927.0 | The Messenger | A. Philip Randolph, Chandler Owen, George S. S... | NaN | male | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | April Is On The Way | Alice Dunbar Nelson | Dunbar-Nelson | April is on the way! \n \nI saw the scarlet f... | December | 1927.0 | Ebony and Topaz | Charles S. Johnson | NaN | female | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 32 columns
<DatetimeArray>
['1900-01-01 00:00:00', '1901-01-01 00:00:00', '1902-01-01 00:00:00',
'1904-01-01 00:00:00', '1905-01-01 00:00:00', '1906-01-01 00:00:00',
'1907-01-01 00:00:00', '1908-01-01 00:00:00', '1911-01-01 00:00:00',
'1912-01-01 00:00:00', '1913-01-01 00:00:00', '1914-01-01 00:00:00',
'1915-01-01 00:00:00', '1916-01-01 00:00:00', '1917-01-01 00:00:00',
'1918-01-01 00:00:00', '1919-01-01 00:00:00', '1920-01-01 00:00:00',
'1921-01-01 00:00:00', '1922-01-01 00:00:00', '1923-01-01 00:00:00',
'1924-01-01 00:00:00', '1925-01-01 00:00:00', '1926-01-01 00:00:00',
'1927-01-01 00:00:00', '1928-01-01 00:00:00', 'NaT']
Length: 27, dtype: datetime64[ns]
It looks like we have some missing year values… we may need to add them back in shortly.
year gender (if known)
1900-01-01 male 5
female 1
1901-01-01 male 31
female 3
1902-01-01 male 3
1904-01-01 male 8
Male 2
1905-01-01 male 3
1906-01-01 male 6
female 2
1907-01-01 male 3
1908-01-01 male 10
female 1
1911-01-01 male 5
1912-01-01 male 3
female 2
1913-01-01 male 7
1914-01-01 male 6
female 1
1915-01-01 male 8
female 3
1916-01-01 female 11
male 8
1917-01-01 male 11
female 11
1918-01-01 male 8
female 4
1919-01-01 male 14
female 7
1920-01-01 male 8
female 7
1921-01-01 male 30
female 6
1922-01-01 female 8
male 5
1923-01-01 male 23
female 8
1924-01-01 female 32
male 26
1925-01-01 male 57
female 27
1926-01-01 male 47
female 30
1927-01-01 male 94
female 58
1928-01-01 male 66
female 33
Name: count, dtype: int64
year | gender (if known) | count | |
---|---|---|---|
0 | 1900-01-01 | male | 5 |
1 | 1900-01-01 | female | 1 |
2 | 1901-01-01 | male | 31 |
3 | 1901-01-01 | female | 3 |
4 | 1902-01-01 | male | 3 |
5 | 1904-01-01 | male | 8 |
6 | 1904-01-01 | Male | 2 |
7 | 1905-01-01 | male | 3 |
8 | 1906-01-01 | male | 6 |
9 | 1906-01-01 | female | 2 |
10 | 1907-01-01 | male | 3 |
11 | 1908-01-01 | male | 10 |
12 | 1908-01-01 | female | 1 |
13 | 1911-01-01 | male | 5 |
14 | 1912-01-01 | male | 3 |
15 | 1912-01-01 | female | 2 |
16 | 1913-01-01 | male | 7 |
17 | 1914-01-01 | male | 6 |
18 | 1914-01-01 | female | 1 |
19 | 1915-01-01 | male | 8 |
20 | 1915-01-01 | female | 3 |
21 | 1916-01-01 | female | 11 |
22 | 1916-01-01 | male | 8 |
23 | 1917-01-01 | male | 11 |
24 | 1917-01-01 | female | 11 |
25 | 1918-01-01 | male | 8 |
26 | 1918-01-01 | female | 4 |
27 | 1919-01-01 | male | 14 |
28 | 1919-01-01 | female | 7 |
29 | 1920-01-01 | male | 8 |
30 | 1920-01-01 | female | 7 |
31 | 1921-01-01 | male | 30 |
32 | 1921-01-01 | female | 6 |
33 | 1922-01-01 | female | 8 |
34 | 1922-01-01 | male | 5 |
35 | 1923-01-01 | male | 23 |
36 | 1923-01-01 | female | 8 |
37 | 1924-01-01 | female | 32 |
38 | 1924-01-01 | male | 26 |
39 | 1925-01-01 | male | 57 |
40 | 1925-01-01 | female | 27 |
41 | 1926-01-01 | male | 47 |
42 | 1926-01-01 | female | 30 |
43 | 1927-01-01 | male | 94 |
44 | 1927-01-01 | female | 58 |
45 | 1928-01-01 | male | 66 |
46 | 1928-01-01 | female | 33 |
# Create a dataframe with a complete range of years from 1900 to 1930
years = pd.DataFrame({'year': pd.date_range(start='1900', end='1930', freq='YS')})
# Merge with existing dataframe
aa_gender_by_year = pd.merge(years, aa_gender_by_year, on='year', how='left')
# Fill NaN values
#aa_gender_by_year['gender (if known)'] = aa_gender_by_year['gender (if known)'].fillna("no author recorded")
#aa_gender_by_year['count'] = aa_gender_by_year['count'].fillna(0)
# Sort years and return unique years to check for missing data
aa_gender_by_year['year'].sort_values().unique()
aa_gender_by_year
year | gender (if known) | count | |
---|---|---|---|
0 | 1900-01-01 | male | 5.0 |
1 | 1900-01-01 | female | 1.0 |
2 | 1901-01-01 | male | 31.0 |
3 | 1901-01-01 | female | 3.0 |
4 | 1902-01-01 | male | 3.0 |
5 | 1903-01-01 | NaN | NaN |
6 | 1904-01-01 | male | 8.0 |
7 | 1904-01-01 | Male | 2.0 |
8 | 1905-01-01 | male | 3.0 |
9 | 1906-01-01 | male | 6.0 |
10 | 1906-01-01 | female | 2.0 |
11 | 1907-01-01 | male | 3.0 |
12 | 1908-01-01 | male | 10.0 |
13 | 1908-01-01 | female | 1.0 |
14 | 1909-01-01 | NaN | NaN |
15 | 1910-01-01 | NaN | NaN |
16 | 1911-01-01 | male | 5.0 |
17 | 1912-01-01 | male | 3.0 |
18 | 1912-01-01 | female | 2.0 |
19 | 1913-01-01 | male | 7.0 |
20 | 1914-01-01 | male | 6.0 |
21 | 1914-01-01 | female | 1.0 |
22 | 1915-01-01 | male | 8.0 |
23 | 1915-01-01 | female | 3.0 |
24 | 1916-01-01 | female | 11.0 |
25 | 1916-01-01 | male | 8.0 |
26 | 1917-01-01 | male | 11.0 |
27 | 1917-01-01 | female | 11.0 |
28 | 1918-01-01 | male | 8.0 |
29 | 1918-01-01 | female | 4.0 |
30 | 1919-01-01 | male | 14.0 |
31 | 1919-01-01 | female | 7.0 |
32 | 1920-01-01 | male | 8.0 |
33 | 1920-01-01 | female | 7.0 |
34 | 1921-01-01 | male | 30.0 |
35 | 1921-01-01 | female | 6.0 |
36 | 1922-01-01 | female | 8.0 |
37 | 1922-01-01 | male | 5.0 |
38 | 1923-01-01 | male | 23.0 |
39 | 1923-01-01 | female | 8.0 |
40 | 1924-01-01 | female | 32.0 |
41 | 1924-01-01 | male | 26.0 |
42 | 1925-01-01 | male | 57.0 |
43 | 1925-01-01 | female | 27.0 |
44 | 1926-01-01 | male | 47.0 |
45 | 1926-01-01 | female | 30.0 |
46 | 1927-01-01 | male | 94.0 |
47 | 1927-01-01 | female | 58.0 |
48 | 1928-01-01 | male | 66.0 |
49 | 1928-01-01 | female | 33.0 |
50 | 1929-01-01 | NaN | NaN |
51 | 1930-01-01 | NaN | NaN |
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# Create the line plot
ax = sns.lineplot(data=aa_gender_by_year, x="year", y="count", hue="gender (if known)")
# Set x-axis major ticks to every 5 years
ax.xaxis.set_major_locator(mdates.YearLocator(5))
# Format the x-axis labels to show the year only
#ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
from matplotlib.ticker import MultipleLocator
plt.figure(figsize=(13.7, 8.27))
# Create the bar plot
ax = sns.barplot(data=aa_gender_by_year,
# pull out just the year
x=aa_gender_by_year["year"].dt.year, y="count", hue="gender (if known)")
# Set x-axis major ticks every 1 year
ax.xaxis.set_major_locator(MultipleLocator(1))
# Set x-axis labels to the desired format (only show every 5th year)
ax.set_xticklabels([date.year for date in pd.date_range(start='1900', end='1930', freq='YS')], rotation=45)
plt.show()
C:\Users\juang\AppData\Local\Temp\ipykernel_38872\3665439903.py:14: UserWarning:
set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
---
title: "Visualize Author Gender By Year (Solution)"
date: "2024-02-26"
categories: [seaborn, plotly, interactive, line-plot, bar-plot, groupby, python, solution]
toc: true
format:
html: default
ipynb: default
code-overflow: wrap
code-fold: show
editor: visual
df-print: kable
R.options:
warn: false
code-tools: true
execute:
eval: true
---
# Import pandas
```{python}
import pandas as pd
```
# Read in CSV
```{python}
aa_df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/aa-periodical-poetry/AAPADA-Periodical-Poetry_1900-1928.csv")
```
# View first 5 rows
```{python}
aa_df.head()
```
# Convert year to datetime value
```{python}
aa_df['year'] = pd.to_datetime(aa_df['year'], format = "%Y")
#aa_df.set_index('year').reindex(pd.date_range('1900-01-01', '1930-12-31', freq='Y'))
# Sort years and return unique years to check for missing data
aa_df['year'].sort_values().unique()
```
It looks like we have some missing year values... we may need to add them back in shortly.
# Group by year, count instances of author by gender
```{python}
aa_df.groupby('year')['gender (if known)'].value_counts()
```
# Make this grouping into a dataframe
```{python}
aa_gender_by_year = aa_df.groupby('year')['gender (if known)'].value_counts().reset_index()
#aa_gender_by_year['year'] = aa_gender_by_year['year'].astype(int)
aa_gender_by_year
```
# Add missing years
```{python}
# Create a dataframe with a complete range of years from 1900 to 1930
years = pd.DataFrame({'year': pd.date_range(start='1900', end='1930', freq='YS')})
# Merge with existing dataframe
aa_gender_by_year = pd.merge(years, aa_gender_by_year, on='year', how='left')
# Fill NaN values
#aa_gender_by_year['gender (if known)'] = aa_gender_by_year['gender (if known)'].fillna("no author recorded")
#aa_gender_by_year['count'] = aa_gender_by_year['count'].fillna(0)
# Sort years and return unique years to check for missing data
aa_gender_by_year['year'].sort_values().unique()
aa_gender_by_year
```
# Visualize with seaborn
```{python}
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# Create the line plot
ax = sns.lineplot(data=aa_gender_by_year, x="year", y="count", hue="gender (if known)")
# Set x-axis major ticks to every 5 years
ax.xaxis.set_major_locator(mdates.YearLocator(5))
# Format the x-axis labels to show the year only
#ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
```
```{python}
from matplotlib.ticker import MultipleLocator
plt.figure(figsize=(13.7, 8.27))
# Create the bar plot
ax = sns.barplot(data=aa_gender_by_year,
# pull out just the year
x=aa_gender_by_year["year"].dt.year, y="count", hue="gender (if known)")
# Set x-axis major ticks every 1 year
ax.xaxis.set_major_locator(MultipleLocator(1))
# Set x-axis labels to the desired format (only show every 5th year)
ax.set_xticklabels([date.year for date in pd.date_range(start='1900', end='1930', freq='YS')], rotation=45)
plt.show()
```
# Make interactive visualizations with plotly
``` {python}
import plotly.express as px
# Create the interactive line plot with Plotly
fig = px.line(aa_gender_by_year,
x='year',
y='count',
color='gender (if known)',
title='Author Gender by Year')
fig.show()
```
``` {python}
fig = px.bar(aa_gender_by_year,
x='year',
y='count',
color='gender (if known)',
title='Gender Count by Year')
fig.show()
```