Twitter Data Collection & Analysis#
In this lesson, we’re going to learn how to analyze and explore Twitter data with the Python/command line tool twarc. We’re specifically going to work with twarc2, which is designed for version 2 of the Twitter API (released in 2020) and the Academic Research track of the Twitter API (released in 2021), which enables researchers to collect tweets from the entire Twitter archive for free.
Twarc was developed by a project called Documenting the Now. The DocNow team develops tools and ethical frameworks for social media research.
Dataset#
[David Foster Wallace]…has become lit-bro shorthand…Make a passing reference to the “David Foster Wallace fanboy” and you can assume the reader knows whom you’re talking about.
—Molly Fischer, “David Foster Wallace, Beloved Author of Bros”
Source: Giovanni Giovanetti, NYT
The Twitter conversation that we’re going to explore in this lesson is related to “Wallace bros” — fans of the author David Foster Wallace who are often described as “bros” or, more pointedly, “David Foster Wallace bros.”
For example, in Slate in 2015, Molly Fischer argued that David Foster Wallace’s writing — most famously his novel Infinite Jest — tended to attract a fan base of chauvinistic and misogynistic young men. But other people have defended Wallace’s fans and the author against such charges. What is a “David Foster Wallace bro”? Was DFW himself a “bro”? Who is using this phrase, how often are they using it, and why? We’re going to track this phrase and explore the varied viewpoints in this cultural conversation by analyzing tweets that mention “David Foster Wallace bro.”
Search Queries & Privacy Concerns#
To collect tweets from the Twitter API, we need to make queries, or requests for specific kinds of tweets — e.g., twarc2 search *query*
. The simplest kind of query is a keyword search, such as the phrase “David Foster Wallace bro,” which should return any tweet that contains all of these words in any order — twarc2 search "David Foster Wallace bro"
.
There are many other operators that we can add to a query, which would allow us to collect tweets only from specific Twitter users or locations, or to only collect tweets that meet certain conditions, such as containing an image or being authored by a verified Twitter user. Here’s an excerpted table of search operators taken from Twitter’s documentation about how to build a search query. There are many other operators beyond those included in this table, and I recommend reading through Twitter’s entire web page on this subject.
Search Operator |
Explanation |
---|---|
keyword |
Matches a keyword within the body of a Tweet. |
“exact phrase match” |
Matches the exact phrase within the body of a Tweet. |
- |
Do NOT match a keyword or operator |
# |
Matches any Tweet containing a recognized hashtag |
from:, to: |
Matches any Tweet from or to a specific user. |
place: |
Matches Tweets tagged with the specified location or Twitter place ID. |
is:reply, is:quote |
Returns only replies or quote tweets. |
is:verified |
Returns only Tweets whose authors are verified by Twitter. |
has:media |
Matches Tweets that contain a media object, such as a photo, GIF, or video, as determined by Twitter. |
has:images, has:videos |
Matches Tweets that contain a recognized URL to an image. |
has:geo |
Matches Tweets that have Tweet-specific geolocation data provided by the Twitter user. |
In this lesson, we will only be collecting tweets that were tweeted by verified users: "David Foster Wallace bro is:verified"
.
As I discussed in “Users’ Data: Legal & Ethical Considerations,” collecting publicly available tweets is legal, but it still raises a lot of privacy concerns and ethical quandaries — particularly when you re-publish user’s data, as I am in this lesson. To reduce potential harm to Twitter users when re-publishing or citing tweets, it can be helpful to ask for explicit permission from the authors or to focus on tweets that have already been reasonably exposed to the public (e.g., tweets with many retweets or tweets from verified users), such that re-publishing the content will not unduly increase risk to the user.
Install and Import Libraries#
Because twarc relies on Twitter’s API, we need to apply for a Twitter developer account and create a Twitter application before we can use it. You can find instructions for the application process in “Twitter API Set Up.”
If you haven’t done so already, you need to install twarc and configure twarc with your bearer token and/or API keys.
#!pip install twarc
#!twarc2 configure
To make an interactive plot, we’re also going to install the package plotly.
!pip install plotly
Then we’re going to import plotly as well as pandas
import plotly.express as px
import pandas as pd
pd.options.display.max_colwidth = 400
pd.options.display.max_columns = 90
Get Tweet Counts#
The first thing we’re going to do is retrieve “tweet counts” — that is, retrieve the number of tweets that included the phrase “David Foster Wallace bro” each day in Twitter’s history.
The tweet counts API endpoint is a convenient feature of the v2 API (first introduced in 2021) that allows us to get a sense of how many tweets will be returned for a given query before we actually collect all the tweets that match the query. We won’t get the text of the tweets or the users who tweeted the tweets or any other relevant data. We will simply get the number of tweets that match the query. This is helpful because we might be able to see that the search query “Wallace” matches too many tweets, which would encourage us to narrow our search by modifying the query.
The tweet counts API endpoint is perhaps even more useful for research projects that are primarily interested in tracking the volume of a Twitter conversation over time. In this case, tweet counts enable a researcher to retrieve this information in a way that’s faster and easier than retrieving all tweets and relevant metadata.
To get tweet counts from Twitter’s entire history with twarc2, we will use twarc2 counts
followed by a search query.
We will also use the flag --csv
because we want to output the data as a CSV, the flag --archive
because we’re working with the Academic Research track of the Twitter API and want access to the full archive, and the flag --granularity day
to get tweet counts per day (other options include hour
and minute
— you can see more in twarc’s documentation). Finally, we write the data to a CSV file.
!twarc2 counts "David Foster Wallace bro is:verified" --csv --archive --granularity day > twitter-data/tweet-counts.csv
We can read in this CSV file with pandas, parse the date columns, and sort from earliest to latest. The code below is largely borrowed from Ed Summers. Thanks, Ed!
Pandas Review
Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!
# Code borrowed from Ed Summers
# https://github.com/edsu/notebooks/blob/master/Black%20Lives%20Matter%20Counts.ipynb
# Read in CSV as DataFrame
tweet_counts_df = pd.read_csv('twitter-data/tweet-counts.csv', parse_dates=['start', 'end'])
# Sort values by earliest date
tweet_counts_df = tweet_counts_df.sort_values('start')
tweet_counts_df
start | end | day_count | |
---|---|---|---|
5735 | 2006-03-21 00:00:00+00:00 | 2006-03-22 00:00:00+00:00 | 0 |
5736 | 2006-03-22 00:00:00+00:00 | 2006-03-23 00:00:00+00:00 | 0 |
5737 | 2006-03-23 00:00:00+00:00 | 2006-03-24 00:00:00+00:00 | 0 |
5738 | 2006-03-24 00:00:00+00:00 | 2006-03-25 00:00:00+00:00 | 0 |
5739 | 2006-03-25 00:00:00+00:00 | 2006-03-26 00:00:00+00:00 | 0 |
... | ... | ... | ... |
26 | 2021-12-19 00:00:00+00:00 | 2021-12-20 00:00:00+00:00 | 0 |
27 | 2021-12-20 00:00:00+00:00 | 2021-12-21 00:00:00+00:00 | 0 |
28 | 2021-12-21 00:00:00+00:00 | 2021-12-22 00:00:00+00:00 | 0 |
29 | 2021-12-22 00:00:00+00:00 | 2021-12-23 00:00:00+00:00 | 0 |
30 | 2021-12-23 00:00:00+00:00 | 2021-12-23 14:32:00+00:00 | 0 |
5757 rows × 3 columns
Then we can make a quick plot of tweets per day with plotly
# Code borrowed from Ed Summers
# https://github.com/edsu/notebooks/blob/master/Black%20Lives%20Matter%20Counts.ipynb
# Make a line plot from the DataFrame and specify x and y axes, axes titles, and plot title
figure = px.line(tweet_counts_df, x='start', y='day_count',
labels={'start': 'Time', 'day_count': 'Tweets per Day'},
title= 'DFW Bro Tweets'
)
figure.show()