# Twitter Data Sharing

In this lesson, we're going to learn how to share Twitter data and access data shared by others with the Python/command line tool [twarc](https://twarc-project.readthedocs.io/en/latest/). We're specifically going to work with [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/), which is designed for version 2 of the Twitter API (released in 2020) and the Academic Research track of the Twitter API (released in 2021).

Twarc was developed by a project called [Documenting the Now](https://www.docnow.io/). The DocNow team develops tools and ethical frameworks for social media research.

*This lesson presumes that you've already installed and configured twarc, which was covered in [a previous lesson](Twitter-API-Setup).*

## Tweet IDs

Twitter discourages developers and researchers from sharing full Twitter data openly on the web. They instead encourage developers and researchers to share *tweet IDs*:

> [If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute **Tweet IDs**, Direct Message IDs, and/or User IDs.](https://developer.twitter.com/en/developer-terms/policy#4-e)

Tweet IDs are unique identifiers assigned to every tweet. They look like a random string of numbers: 1189206626135355397. Each tweet ID can be used to download the full data associated with that tweet (if the tweet still exists). This is a process called "hydration."

<img src="https://cdn.pixabay.com/photo/2013/07/12/19/24/sapling-154734_960_720.png" width=100% >

**Hydration: a young tweet ID sprouts into a full tweet (to be read in David Attenborough's voice)**

There are actually two reasons that you might want to dehydrate tweets and/or hydrate tweet IDs: first, to responsibly share Twitter data with others and/or access Twitter data shared by others; second, to get more information about the Twitter data that you yourself collected.

If you collected tweets in real time, for example, you collected those tweets immediately after they were published, which means that they will not contain any retweet or favorite count information. Nobody's had time to retweet them yet! So if you'd like to retroactively get retweet and favorite count information about your tweets, then you would want to dehydrate and rehydrate them.

## Dehydrate Tweets

`twarc2 dehydrate tweets.jsonl > tweet_ids.txt`

To transform your Twitter data into a list of tweet IDs (so that you can share your data openly on the web), you can run the twarc command `twarc2 dehydrate` with the name of your JSONL file followed by the output operator `>` and the desired name of your tweet ID text file.

> tweet ID —> tweet = hydration <br>
> tweet ID <— tweet = dehydration

Let's dehydrate the Twitter data that we collected about "Infinite Jest" from only verified Twitter accounts.

In [1]:
!twarc2 dehydrate twitter-data/dfw_bro.jsonl > twitter-data/dfw_bro.txt

If we `open()` and `.read()` the tweet IDs file that we just created, it looks something like this:

In [4]:
tweet_ids = open("twitter-data/dfw_bro.txt", encoding="utf-8").read()

In [5]:
print(tweet_ids)

1412360883582418945
1408833479304003589
1313590839977947136
1313523086424186881
1298060678432010240
1297973554986639360
1296466668898717696
1259920233600479233
1075778924280471553
995487617549512704
898246334804787202
844552567522820096
760889024403873792
733665498148294657
694224894805032964
686753187727044609
644308060937265152
644307992901517312
644307901830561793
632152536636608512
631931489769390080
631828145105162240
631825643441901568
631825543202213889
631820349290774528
545020559248343040
270661455727165442
146330510422048768



## Hydrate Tweets

`twarc2 hydrate tweet_ids.txt > tweets.jsonl`

To transform a list of tweet IDs into full Twitter data, you can run the twarc command `twarc2 hydrate` with the name of your tweet IDs text file followed by the output operator `>` and the desired name of your JSONL file.

> tweet ID —> tweet = hydration <br>
> tweet ID <— tweet = dehydration

Now let's re-hydrate the Twitter data that we collected a few weeks ago based on the tweet IDs that we just dehydrated.

In [6]:
!twarc2 hydrate twitter-data/dfw_bro.txt > twitter-data/dfw_bro_REHYDRATED.jsonl

In [9]:
tweet_json = open("twitter-data/dfw_bro_REHYDRATED.jsonl", encoding="utf-8").read()

In [10]:
print(tweet_json)

{"data": [{"context_annotations": [{"domain": {"id": "46", "name": "Brand Category", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "781974596752842752", "name": "Services"}}, {"domain": {"id": "47", "name": "Brand", "description": "Brands and Companies"}, "entity": {"id": "10045225402", "name": "Twitter"}}], "id": "1412360883582418945", "reply_settings": "everyone", "entities": {"annotations": [{"start": 90, "end": 109, "probability": 0.89, "type": "Person", "normalized_text": "david foster wallace"}], "mentions": [{"start": 0, "end": 10, "username": "TomKealy2", "id": "82979289"}, {"start": 11, "end": 19, "username": "aveek18", "id": "525812389"}, {"start": 20, "end": 32, "username": "maybeavalon", "id": "108463534"}]}, "referenced_tweets": [{"type": "replied_to", "id": "1412360683237355524"}], "author_id": "342644076", "text": "@TomKealy2 @aveek18 @maybeavalon Imo the great twitter literature injustice is that it is david f

## Deleted Tweets & The Right To Be Forgotten

What happens if someone decides to delete their tweet between the time when the tweet is first collected and the time when the tweet is "hydrated"? The deleted tweet will **not** be hydrated. The deleted tweet is no longer be accessible.

## Where to Find Tweet IDs

You can find repositories of tweet IDs that have been shared by other researchers in the following places:

- DocNow Catalog: https://catalog.docnow.io/

- George Washington University Tweet IDs: https://dataverse.harvard.edu/dataverse/gwu-libraries