TF-IDF with HathiTrust Data

TF-IDF with HathiTrust Data#

In this lesson, we’re going to learn about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf.

While calculating the most frequent words in a text can be useful, the most frequent words in a text usually aren’t the most interesting words in a text, even if we get rid of stop words (“the, “and,” “to,” etc.). Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent or significant words in a document.

In this lesson, we will cover how to:

Calculate and normalize tf-idf scores for each short story in Edward P. Jones’s Lost in the City
Download and process HathiTrust extracted features — that is, word frequencies for books in the HathiTrust Digital Library (including in-copyright books like Lost in the City)
Prepare HathiTrust extracted features for tf-idf analysis

Dataset#

Lost in the City by Edward P. Jones#

[T]he pigeon had taken a step and dropped from the ledge. He caught an upwind that took him nearly as high as the tops of the empty K Street houses. He flew farther into Northeast, into the color and sounds of the city’s morning. She did nothing, aside from following him, with her eyes, with her heart, as far as she could.

—Edward P. Jones, "The Girl Who Raised Pigeons," Lost in the City (1993)

Edward P. Jones’s Lost in the City (1993) is a collection of 14 short stories set in Washington D.C. The first short story, “The Girl Who Raised Pigeons,” begins with a young girl raising homing pigeons on her roof.

How distinctive is a “pigeon” in the world of Lost in the City? What does this uniqueness (or lackthereof) tell us about the meaning of pigeons in first short story “The Girl Who Raised Pigeons” and the collection as a whole? These are just a few of the questions that we’re going to try to answer with tf-idf.

If you already have a collection of plain text (.txt) files that you’d like to analyze, one of the easiest ways to calculate tf-idf scores is to use the Python library scikit-learn. It has a quick and nifty module called TfidfVectorizer, which does all the math for you behind the scenes. We will cover how to use the TfidfVectorizer in the next lesson.

In this lesson, however, we’re going to calculate tf-idf scores manually because Lost in the City is still in-copyright, which means that, for legal reasons, we can’t easily share or access plain text files of the book.

Luckily, the HathiTrust Digital Library—which contains digitized books from Google Books as well as many university libraries—has released word frequencies per page for all 17 million books in its catalog. These word frequencies (plus part of speech tags) are otherwise known as “extracted features.” There’s a lot of text analysis that we can do with extracted features alone, including tf-idf.

So to calculate tf-idf scores for Lost in the City, we’re going to use HathiTrust extracted features. That’s why we’re not using sci-kit learn’s TfidfVectorizer. It works great with plain text files but not so great with extracted features.

Breaking Down the TF-IDF Formula#

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

tf-idf = term_frequency * inverse_document_frequency

term_frequency = number of times a given term appears in document

inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

Let’s test it out#

We need the log() function for our calculation, otherwise known as logarithm, so we’re going to import the numpy package.

import numpy as np

“said”

total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"

term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

term_frequency * inverse_document_frequency

50.24266495988672

“pigeons”

total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"

term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

term_frequency * inverse_document_frequency

78.28313737302301

tf–idf scores for “The Girl Who Raised Pigeons”

“said” = 50.48
“pigeons” = 78.28

Though the word “said” appears 47 times in “The Girl Who Raised Pigeons” and the word “pigeons” only appears 30 times, “pigeons” has a higher tf–idf score than “said” because it’s a rarer word. The word “pigeons” appears in 2 of 14 stories, while “said” appears in 13 of 14 stories, almost all of them.

Get HathiTrust Extracted Features#

Now let’s try to calculate tf-idf scores for all the words in all the short stories in Lost in the City. To do so, we need word counts, or HathiTrust extracted features, for each story in the collection.

To work with HathiTrust’s extracted features, we first need to install and import the HathiTrust Feature Reader.

Install HathiTrust Feature Reader

!pip install htrc-feature-reader

Import necessary libraries

from htrc_features import Volume
import pandas as pd

Pandas Review

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

Then we need to locate the the HathiTrust volume ID for Lost in the City. If we search the HathiTrust catalog for this book and then click on “Limited (search only),” it will take us to the following web page: https://babel.hathitrust.org/cgi/pt?id=mdp.39015029970129.

The HathiTrust Volume ID for Lost in the City is located after id= this URL: mdp.39015029970129.

Make DataFrame of Word Frequencies From Volume(s)#

Single Volume#

To get HathiTrust extracted features for a single volume, we can create a [Volume object](htrc/htrc-feature-reader) and use the .tokenlist() method.

Volume('mdp.39015029970129').tokenlist()

				count
page	section	token	pos
1	body	,	,	1
		.046	CD	1
		1993	CD	1
		3560	CD	1
		AWARD	NN	1
...	...	...	...	...
260	body	world	NN	2
		would	MD	1
		writers	NNS	1
		written	VBN	1
		•	SYM	1

51297 rows × 1 columns

For each page in Lost in the City, this DataFrame displays the page number and section type as well as every word/token that appears on the page, its part-of-speech, and the number of times that word/token occurs on the page. As you can see, there are 51,297 rows in this DataFrame — one for each token that appears on each page.

Let’s look at a sample of just 20 words from page 11.

Volume('mdp.39015029970129').tokenlist()[500:520]

				count
page	section	token	pos
11	body	out	RP	1
		over	IN	1
		part	NN	1
		past	IN	1
		pee	VB	1
		pigeon	NN	1
		pigeons	NNS	1
		reach	VB	1
		remained	VBD	1
		roof	NN	1
		room	NN	2
		say	VB	1
		seemed	VBD	1
		set	VBN	1
		share	VB	1
		she	PRP	7
		silenccBometimes	NNS	1
		silence	NN	1
		simple	JJ	1
		slats	NNS	1

We can also get metadata for a HathiTrust volume by asking for certain attributes.

Volume('mdp.39015029970129').year

Volume('mdp.39015029970129').page_count

Volume('mdp.39015029970129').publisher

'HarperPerennial'

Multiple Volumes#

We might want to get extracted features for multiple volumes at the same time, so we’re also going to practice a workflow that will allow us to read in multiple HathiTrust books, even though we’re only reading in one book at this moment.

Insert list of desired HathiTrust volume(s)

volume_ids = ['mdp.39015029970129']

Loop through this list of volume IDs and make a DataFrame that includes extracted features, book title, and publication year, then make a list of all DataFrames.

all_tokens = []

for hathi_id in volume_ids:
    
    #Read in HathiTrust volume
    volume = Volume(hathi_id)
    
    #Make dataframe from token list -- do not include part of speech, sections, or case sensitivity
    token_df = volume.tokenlist(case=False, pos=False, drop_section=True)
    
    #Add book column
    token_df['book'] = volume.title
    
    #Add publication year column
    token_df['year'] = volume.year
    
    all_tokens.append(token_df)

Concatenate the list of DataFrames

lost_df = pd.concat(all_tokens)

Preview the DataFrame

lost_df

		count	book	year
page	lowercase
1	,	1	Lost in the city : stories /	1993
	.046	1	Lost in the city : stories /	1993
	1993	1	Lost in the city : stories /	1993
	3560	1	Lost in the city : stories /	1993
	a	1	Lost in the city : stories /	1993
...	...	...	...	...
260	would	1	Lost in the city : stories /	1993
	writers	1	Lost in the city : stories /	1993
	written	1	Lost in the city : stories /	1993
	york	1	Lost in the city : stories /	1993
	•	1	Lost in the city : stories /	1993

47307 rows × 3 columns

Change from multi-level index to regular index with reset_index()

lost_df_flattened = lost_df.reset_index()

lost_df_flattened

	page	lowercase	count	book	year
0	1	,	1	Lost in the city : stories /	1993
1	1	.046	1	Lost in the city : stories /	1993
2	1	1993	1	Lost in the city : stories /	1993
3	1	3560	1	Lost in the city : stories /	1993
4	1	a	1	Lost in the city : stories /	1993
...	...	...	...	...	...
47302	260	would	1	Lost in the city : stories /	1993
47303	260	writers	1	Lost in the city : stories /	1993
47304	260	written	1	Lost in the city : stories /	1993
47305	260	york	1	Lost in the city : stories /	1993
47306	260	•	1	Lost in the city : stories /	1993

47307 rows × 5 columns

Nice! We now have a DataFrame of word counts per page for Lost in the City.

But what we need to move forward with tf-idf is a way of splitting this collection into its individual stories. Remember: to use tf-idf, we need a collection of texts because we need to compare word frequency for one document with all the other documents in the collection.

Add story titles#

How can we split up Lost in the City into individual stories?

Sometimes HathiTrust Extracted Features helpfully include “section” information for a book, such as chapter titles. Unfortunately, the extracted features for Lost in the City do not include chapter or story titles.

They do, however, include page numbers and, if you specify volume.tokenlist(case=True), words with case sensitivity. When I manually combed through the HTRC token list with case sensitivity turned on, I noticed that the title page for each short story seemed to format the title in all-caps. So I searched for all-caps words from each story title and noted down the corresponding page number. This should give us a marker of where every story begins and ends.

The function below will add in Lost in the City’s story titles for the correct page numbers and corresponding words.

def add_story_titles(page):
    if page >= 0 and page < 11:
        return "Front Matter"
    if page >= 11 and page < 35:
        return "01: The Girl Who Raised Pigeons"
    elif page >= 35 and page < 41:
        return "02: The First Day"
    elif page >= 41 and page < 63:
        return "03: The Night Rhonda Ferguson Was Killed"
    elif page >= 63 and page < 85:
        return "04: Young Lions"
    elif page >= 85 and page < 113:
        return "05: The Store"
    elif page >= 113 and page < 125:
        return "06: An Orange Line Train to Ballston"
    elif page >= 125 and page < 149:
        return "07: The Sunday Following Mother's Day"
    elif page >= 149 and page < 159:
        return "08: Lost in the City"
    elif page >= 159 and page < 184:
        return "09: His Mother's House"
    elif page >= 184 and page < 191:
        return "10: A Butterfly on F Street"
    elif page >= 191 and page < 209:
        return "11: Gospel"
    elif page >= 209 and page < 225:
        return "12: A New Man"
    elif page >= 225 and page < 237:
        return "13: A Dark Night"
    elif page >= 237 and page <= 252:
        return "14: Marie"
    elif page > 252:
        return "Back Matter"

Below we add a new column of story titles to the DataFrame by apply()ing our function to the “page” column and dumping the results to lost_df_flattened['story']. You can read more about applying functions in “Pandas Basics - Part 3”.

lost_df_flattened['story'] = lost_df_flattened['page'].apply(add_story_titles)

We’re also going to drop the “Front Matter” and “Back Matter” from the DataFrame.

lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Front Matter'].index)

lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Back Matter'].index)

Sum Word Counts For Each Story#

Page-level information is great. But for tf-idf purposes, we really only care about the frequency of words for every story. Below we group by story and calculate the sum of word frequencies for all the pages in that story.

lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

	story	lowercase	count
0	01: The Girl Who Raised Pigeons	!	8
1	01: The Girl Who Raised Pigeons	'	4
2	01: The Girl Who Raised Pigeons	''	111
3	01: The Girl Who Raised Pigeons	'd	1
4	01: The Girl Who Raised Pigeons	'll	5
...	...	...	...
18082	14: Marie	yet	1
18083	14: Marie	you	39
18084	14: Marie	you-know-who	1
18085	14: Marie	young	8
18086	14: Marie	your	8

18087 rows × 3 columns

Notice how the “page” column no longer exists in the DataFrame and our rows have slimmed down from more than 40,000 to 18,000.

word_frequency_df = lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

Remove Infrequent Words, Stopwords, & Punctuation#

We will conclude with some final pre-processing steps. We will remove the list of stopwords defined below.

Make list of stopwords

STOPS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
         'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
         'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
         'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
         'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
         'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
         'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp', "!"]

Remove stopwords

word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].isin(STOPS)].index)

We will also remove punctuation by using a regular expression [^A-Za-z\s], which matches anything that’s not a letter and drops it from the DataFrame.

word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].str.contains('[^A-Za-z\s]', regex=True)].index)

#Remove words that appear less than 5 times in a book
#word_frequency_df_test = word_frequency_df[word_frequency_df['count'] > 5]

word_frequency_df

	story	lowercase	count
36	01: The Girl Who Raised Pigeons	abandoned	2
37	01: The Girl Who Raised Pigeons	able	2
40	01: The Girl Who Raised Pigeons	absently	1
41	01: The Girl Who Raised Pigeons	absolute	1
42	01: The Girl Who Raised Pigeons	accepted	1
...	...	...	...
18079	14: Marie	years	10
18080	14: Marie	yes	2
18081	14: Marie	yesterday	2
18082	14: Marie	yet	1
18085	14: Marie	young	8

15726 rows × 3 columns

TF-IDF#

Term Frequency#

We already have term frequencies for each document. Let’s rename the columns so that they’re consistent with the tf-idf vocabulary that we’ve been using.

word_frequency_df = word_frequency_df.rename(columns={'lowercase': 'term','count': 'term_frequency'})

word_frequency_df

	story	term	term_frequency
36	01: The Girl Who Raised Pigeons	abandoned	2
37	01: The Girl Who Raised Pigeons	able	2
40	01: The Girl Who Raised Pigeons	absently	1
41	01: The Girl Who Raised Pigeons	absolute	1
42	01: The Girl Who Raised Pigeons	accepted	1
...	...	...	...
18079	14: Marie	years	10
18080	14: Marie	yes	2
18081	14: Marie	yesterday	2
18082	14: Marie	yet	1
18085	14: Marie	young	8

15726 rows × 3 columns

Document Frequency#

To calculate the number of documents or stories in which each term appears, we’re going to create a separate DataFrame and do some Pandas manipulation and calculation.

document_frequency_df = (word_frequency_df.groupby(['story','term']).size().unstack()).sum().reset_index()

If you inspect parts of the complex chain of Pandas methods above (which is always a great way to learn!), you will see that we’re momentarily reshaping the DataFrame to see if each term appears in each story…

word_frequency_df.groupby(['story','term']).size().unstack()

Show code cell output Hide code cell output

term	abandoned	abhored	abide	ability	able	abomination	aboum	aboutfcfteen	abqu	absently	...	ypu	yr	ysirs	ythe	yuddini	zigzagging	zion	zipped	zippers	zoo
story
01: The Girl Who Raised Pigeons	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	1.0	...	NaN	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN
02: The First Day	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
03: The Night Rhonda Ferguson Was Killed	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
04: Young Lions	NaN	NaN	NaN	NaN	1.0	NaN	1.0	NaN	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN
05: The Store	NaN	NaN	NaN	1.0	1.0	1.0	NaN	1.0	NaN	NaN	...	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
06: An Orange Line Train to Ballston	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0
07: The Sunday Following Mother's Day	1.0	1.0	1.0	NaN	1.0	NaN	NaN	NaN	1.0	NaN	...	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
08: Lost in the City	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
09: His Mother's House	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN
10: A Butterfly on F Street	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN
11: Gospel	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN
12: A New Man	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN
13: A Dark Night	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14: Marie	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

14 rows × 6207 columns

Then we’re adding up how many stories each term appears in (.sum()) and resetting the index (.reset_index()) to make a DataFrame.

Finally, we will rename the column in this DataFrame and merge it into our word frequency DataFrame.

document_frequency_df = document_frequency_df.rename(columns={0:'document_frequency'})

word_frequency_df = word_frequency_df.merge(document_frequency_df)

Now we have term frequency and document frequency.

word_frequency_df

	story	term	term_frequency	document_frequency
0	01: The Girl Who Raised Pigeons	abandoned	2	3.0
1	07: The Sunday Following Mother's Day	abandoned	1	3.0
2	09: His Mother's House	abandoned	1	3.0
3	01: The Girl Who Raised Pigeons	able	2	12.0
4	03: The Night Rhonda Ferguson Was Killed	able	3	12.0
...	...	...	...	...
15721	14: Marie	whim	1	1.0
15722	14: Marie	wilamena	20	1.0
15723	14: Marie	wise	8	1.0
15724	14: Marie	womanish	1	1.0
15725	14: Marie	worships	1	1.0

15726 rows × 4 columns

As you can see in the DataFrame above, the term “abandoned” appears 2 times in the story “The Girl Who Raised Pigeons” (term frequency), and it appears in 3 different stories in the collection overall (document frequency).

Total Number of Documents#

To calculate the total number of documents are in the collection, we count how many unique values are in the “story” column (we know the answer should be 14 short stories).

total_number_of_documents = lost_df_flattened['story'].nunique()

total_number_of_documents

Inverse Document Frequency#

As we previously established, there are a lot of slightly different versions of the tf-idf formula, but we’re going to use the default version from the scikit-learn library that adds “smoothing” to inverse document frequency.

inverse_document_frequency = log [ (1 + total number of docs) / (1 + document frequency) ] + 1

import numpy as np

word_frequency_df['idf'] = np.log((1 + total_number_of_documents) / (1 + word_frequency_df['document_frequency'])) + 1

TF- IDF#

Finally, we will calculate tf-idf by multiplying term frequency and inverse document frequency together.

word_frequency_df['tfidf'] = word_frequency_df['term_frequency'] * word_frequency_df['idf']

Then we will normalize these values with the scikit-learn library.

from sklearn import preprocessing

word_frequency_df['tfidf_normalized'] = preprocessing.normalize(word_frequency_df[['tfidf']], axis=0, norm='l2')

We did it! Now let’s inspect the top 15 words with the highest tfidf scores for each story in the collection

word_frequency_df.sort_values(by=['story','tfidf_normalized'], ascending=[True,False]).groupby(['story']).head(15)

	story	term	term_frequency	document_frequency	idf	tfidf	tfidf_normalized
655	01: The Girl Who Raised Pigeons	betsy	44	1.0	3.014903	132.655733	0.106417
3317	01: The Girl Who Raised Pigeons	jenny	42	1.0	3.014903	126.625927	0.101580
212	01: The Girl Who Raised Pigeons	ann	45	2.0	2.609438	117.424706	0.094199
5566	01: The Girl Who Raised Pigeons	robert	36	1.0	3.014903	108.536509	0.087069
1384	01: The Girl Who Raised Pigeons	coop	28	1.0	3.014903	84.417285	0.067720
7887	01: The Girl Who Raised Pigeons	would	84	14.0	1.000000	84.000000	0.067385
5053	01: The Girl Who Raised Pigeons	pigeons	30	2.0	2.609438	78.283137	0.062799
4238	01: The Girl Who Raised Pigeons	miss	46	10.0	1.310155	60.267127	0.048347
688	01: The Girl Who Raised Pigeons	birds	29	5.0	1.916291	55.572431	0.044581
1191	01: The Girl Who Raised Pigeons	clara	17	1.0	3.014903	51.253351	0.041116
5622	01: The Girl Who Raised Pigeons	said	47	13.0	1.068993	50.242665	0.040305
5350	01: The Girl Who Raised Pigeons	ralph	17	2.0	2.609438	44.360445	0.035586
4189	01: The Girl Who Raised Pigeons	miles	21	4.0	2.098612	44.070858	0.035354
5052	01: The Girl Who Raised Pigeons	pigeon	14	1.0	3.014903	42.208642	0.033860
6837	01: The Girl Who Raised Pigeons	thelma	14	1.0	3.014903	42.208642	0.033860
4340	02: The First Day	mother	42	13.0	1.068993	44.897701	0.036017
7772	02: The First Day	woman	23	14.0	1.000000	23.000000	0.018451
8659	02: The First Day	takes	7	2.0	2.609438	18.266065	0.014653
3895	02: The First Day	looks	6	2.0	2.609438	15.656627	0.012560
8568	02: The First Day	says	7	5.0	1.916291	13.414035	0.010761
8210	02: The First Day	form	6	4.0	2.098612	12.591674	0.010101
5701	02: The First Day	school	10	11.0	1.223144	12.231436	0.009812
8575	02: The First Day	seaton	4	1.0	3.014903	12.059612	0.009674
4757	02: The First Day	one	12	14.0	1.000000	12.000000	0.009626
8268	02: The First Day	jersey	5	3.0	2.321756	11.608779	0.009313
8667	02: The First Day	tells	4	3.0	2.321756	9.287023	0.007450
8036	02: The First Day	appears	3	1.0	3.014903	9.044709	0.007256
8046	02: The First Day	asks	3	1.0	3.014903	9.044709	0.007256
8089	02: The First Day	blondelle	3	1.0	3.014903	9.044709	0.007256
8358	02: The First Day	mary	3	1.0	3.014903	9.044709	0.007256
8974	03: The Night Rhonda Ferguson Was Killed	cassandra	130	1.0	3.014903	391.937393	0.314415
9693	03: The Night Rhonda Ferguson Was Killed	melanie	65	1.0	3.014903	195.968696	0.157207
8778	03: The Night Rhonda Ferguson Was Killed	anita	68	2.0	2.609438	177.441778	0.142345
10002	03: The Night Rhonda Ferguson Was Killed	rhonda	42	1.0	3.014903	126.625927	0.101580
5623	03: The Night Rhonda Ferguson Was Killed	said	109	13.0	1.068993	116.520223	0.093473
9386	03: The Night Rhonda Ferguson Was Killed	gladys	38	2.0	2.609438	99.158641	0.079546
8944	03: The Night Rhonda Ferguson Was Killed	car	42	11.0	1.223144	51.372029	0.041211
6510	03: The Night Rhonda Ferguson Was Killed	street	42	13.0	1.068993	44.897701	0.036017
453	03: The Night Rhonda Ferguson Was Killed	back	39	14.0	1.000000	39.000000	0.031286
2602	03: The Night Rhonda Ferguson Was Killed	get	36	13.0	1.068993	38.483743	0.030872
2646	03: The Night Rhonda Ferguson Was Killed	girls	17	4.0	2.098612	35.676409	0.028620
2204	03: The Night Rhonda Ferguson Was Killed	father	32	13.0	1.068993	34.207772	0.027442
10481	03: The Night Rhonda Ferguson Was Killed	wesley	13	2.0	2.609438	33.922693	0.027213
9571	03: The Night Rhonda Ferguson Was Killed	joyce	12	2.0	2.609438	31.313255	0.025120
9833	03: The Night Rhonda Ferguson Was Killed	pearl	11	2.0	2.609438	28.703817	0.023026
10696	04: Young Lions	caesar	75	1.0	3.014903	226.117727	0.181393
11527	04: Young Lions	sherman	60	1.0	3.014903	180.894181	0.145114
11198	04: Young Lions	manny	44	1.0	3.014903	132.655733	0.106417
10701	04: Young Lions	carol	29	1.0	3.014903	87.432188	0.070139
5624	04: Young Lions	said	71	13.0	1.068993	75.898494	0.060886
11452	04: Young Lions	retarded	22	2.0	2.609438	57.407634	0.046053
7890	04: Young Lions	would	57	14.0	1.000000	57.000000	0.045726
11016	04: Young Lions	heh	17	1.0	3.014903	51.253351	0.041116
7774	04: Young Lions	woman	44	14.0	1.000000	44.000000	0.035297
10568	04: Young Lions	anna	13	1.0	3.014903	39.193739	0.031441
4006	04: Young Lions	man	34	13.0	1.068993	36.345758	0.029157
454	04: Young Lions	back	35	14.0	1.000000	35.000000	0.028077
1401	04: Young Lions	could	29	13.0	1.068993	31.000793	0.024869
7199	04: Young Lions	two	30	14.0	1.000000	30.000000	0.024066
2205	04: Young Lions	father	28	13.0	1.068993	29.931800	0.024011
12577	05: The Store	penny	57	2.0	2.609438	148.737961	0.119319
5625	05: The Store	said	79	13.0	1.068993	84.450437	0.067747
7891	05: The Store	would	79	14.0	1.000000	79.000000	0.063374
6479	05: The Store	store	51	10.0	1.310155	66.817901	0.053602
12353	05: The Store	jenkins	19	1.0	3.014903	57.283157	0.045953
12375	05: The Store	kentucky	23	3.0	2.321756	53.400384	0.042838
12427	05: The Store	lonney	17	1.0	3.014903	51.253351	0.041116
455	05: The Store	back	50	14.0	1.000000	50.000000	0.040110
4760	05: The Store	one	48	14.0	1.000000	48.000000	0.038506
4343	05: The Store	mother	42	13.0	1.068993	44.897701	0.036017
6969	05: The Store	time	42	14.0	1.000000	42.000000	0.033693
1552	05: The Store	day	40	14.0	1.000000	40.000000	0.032088
1402	05: The Store	could	34	13.0	1.068993	36.345758	0.029157
2206	05: The Store	father	32	13.0	1.068993	34.207772	0.027442
11865	05: The Store	baxter	11	1.0	3.014903	33.163933	0.026604
13144	06: An Orange Line Train to Ballston	marcus	38	1.0	3.014903	114.566315	0.091906
13046	06: An Orange Line Train to Ballston	avis	26	1.0	3.014903	78.387479	0.062883
13146	06: An Orange Line Train to Ballston	marvin	24	1.0	3.014903	72.357672	0.058046
13145	06: An Orange Line Train to Ballston	marvella	23	1.0	3.014903	69.342769	0.055627
5626	06: An Orange Line Train to Ballston	said	64	13.0	1.068993	68.415544	0.054883
4008	06: An Orange Line Train to Ballston	man	63	13.0	1.068993	67.346551	0.054026
7084	06: An Orange Line Train to Ballston	train	25	5.0	1.916291	47.907268	0.038432
13216	06: An Orange Line Train to Ballston	subway	15	1.0	3.014903	45.223545	0.036279
13087	06: An Orange Line Train to Ballston	dreadlocks	11	1.0	3.014903	33.163933	0.026604
13086	06: An Orange Line Train to Ballston	dreadlock	8	1.0	3.014903	24.119224	0.019349
3765	06: An Orange Line Train to Ballston	line	18	10.0	1.310155	23.582789	0.018918
4800	06: An Orange Line Train to Ballston	orange	11	4.0	2.098612	23.084735	0.018519
4344	06: An Orange Line Train to Ballston	mother	17	13.0	1.068993	18.172879	0.014578
13050	06: An Orange Line Train to Ballston	ballston	6	1.0	3.014903	18.089418	0.014511
3741	06: An Orange Line Train to Ballston	like	17	14.0	1.000000	17.000000	0.013638
13627	07: The Sunday Following Mother's Day	madeleine	74	1.0	3.014903	223.102824	0.178974
13626	07: The Sunday Following Mother's Day	maddie	62	1.0	3.014903	186.923987	0.149952
13770	07: The Sunday Following Mother's Day	samuel	41	1.0	3.014903	123.611024	0.099162
13769	07: The Sunday Following Mother's Day	sam	34	1.0	3.014903	102.506703	0.082232
5627	07: The Sunday Following Mother's Day	said	81	13.0	1.068993	86.588423	0.069462
7893	07: The Sunday Following Mother's Day	would	71	14.0	1.000000	71.000000	0.056957
13711	07: The Sunday Following Mother's Day	pookie	21	1.0	3.014903	63.312963	0.050790
9093	07: The Sunday Following Mother's Day	curtis	16	2.0	2.609438	41.751007	0.033493
13281	07: The Sunday Following Mother's Day	arnisa	13	1.0	3.014903	39.193739	0.031441
13893	07: The Sunday Following Mother's Day	williams	12	1.0	3.014903	36.178836	0.029023
1404	07: The Sunday Following Mother's Day	could	33	13.0	1.068993	35.276765	0.028299
1554	07: The Sunday Following Mother's Day	day	35	14.0	1.000000	35.000000	0.028077
4009	07: The Sunday Following Mother's Day	man	32	13.0	1.068993	34.207772	0.027442
457	07: The Sunday Following Mother's Day	back	33	14.0	1.000000	33.000000	0.026473
4762	07: The Sunday Following Mother's Day	one	32	14.0	1.000000	32.000000	0.025671
14079	08: Lost in the City	lydia	32	1.0	3.014903	96.476897	0.077394
5628	08: Lost in the City	said	46	13.0	1.068993	49.173672	0.039447
4346	08: Lost in the City	mother	44	13.0	1.068993	47.035686	0.037732
10969	08: Lost in the City	georgia	19	3.0	2.321756	44.113361	0.035388
972	08: Lost in the City	cab	13	5.0	1.916291	24.911780	0.019984
9157	08: Lost in the City	driver	8	5.0	1.916291	15.330326	0.012298
13908	08: Lost in the City	antibes	5	1.0	3.014903	15.074515	0.012093
13976	08: Lost in the City	dreaming	5	1.0	3.014903	15.074515	0.012093
3455	08: Lost in the City	know	15	14.0	1.000000	15.000000	0.012033
7894	08: Lost in the City	would	15	14.0	1.000000	15.000000	0.012033
2607	08: Lost in the City	get	14	13.0	1.068993	14.965900	0.012006
4010	08: Lost in the City	man	14	13.0	1.068993	14.965900	0.012006
6925	08: Lost in the City	thought	14	14.0	1.000000	14.000000	0.011231
10189	08: Lost in the City	sorry	9	8.0	1.510826	13.597431	0.010908
4763	08: Lost in the City	one	13	14.0	1.000000	13.000000	0.010429
9572	09: His Mother's House	joyce	84	2.0	2.609438	219.192785	0.175838
14513	09: His Mother's House	rickey	64	1.0	3.014903	192.953793	0.154789
14529	09: His Mother's House	santiago	54	1.0	3.014903	162.804763	0.130603
5629	09: His Mother's House	said	96	13.0	1.068993	102.623316	0.082325
14384	09: His Mother's House	humphrey	33	1.0	3.014903	99.491800	0.079813
9834	09: His Mother's House	pearl	22	2.0	2.609438	57.407634	0.046053
14527	09: His Mother's House	sandy	18	1.0	3.014903	54.268254	0.043534
4764	09: His Mother's House	one	50	14.0	1.000000	50.000000	0.040110
7895	09: His Mother's House	would	49	14.0	1.000000	49.000000	0.039308
14577	09: His Mother's House	smokey	16	1.0	3.014903	48.238448	0.038697
3172	09: His Mother's House	house	35	12.0	1.143101	40.008530	0.032095
3744	09: His Mother's House	like	38	14.0	1.000000	38.000000	0.030484
4347	09: His Mother's House	mother	35	13.0	1.068993	37.414751	0.030014
6671	09: His Mother's House	table	29	11.0	1.223144	35.471163	0.028455
7779	09: His Mother's House	woman	35	14.0	1.000000	35.000000	0.028077
9700	10: A Butterfly on F Street	mildred	27	2.0	2.609438	70.454824	0.056519
7780	10: A Butterfly on F Street	woman	29	14.0	1.000000	29.000000	0.023264
14675	10: A Butterfly on F Street	butterfly	7	1.0	3.014903	21.104321	0.016930
14710	10: A Butterfly on F Street	mansfield	6	1.0	3.014903	18.089418	0.014511
5630	10: A Butterfly on F Street	said	16	13.0	1.068993	17.103886	0.013721
6517	10: A Butterfly on F Street	street	14	13.0	1.068993	14.965900	0.012006
14713	10: A Butterfly on F Street	median	4	1.0	3.014903	12.059612	0.009674
14750	10: A Butterfly on F Street	woolworth	4	1.0	3.014903	12.059612	0.009674
460	10: A Butterfly on F Street	back	10	14.0	1.000000	10.000000	0.008022
5672	10: A Butterfly on F Street	say	9	13.0	1.068993	9.620936	0.007718
14714	10: A Butterfly on F Street	morton	3	1.0	3.014903	9.044709	0.007256
9258	10: A Butterfly on F Street	f	5	6.0	1.762140	8.810700	0.007068
1557	10: A Butterfly on F Street	day	8	14.0	1.000000	8.000000	0.006418
7896	10: A Butterfly on F Street	would	8	14.0	1.000000	8.000000	0.006418
9387	10: A Butterfly on F Street	gladys	3	2.0	2.609438	7.828314	0.006280
15094	11: Gospel	vivian	68	1.0	3.014903	205.013405	0.164463
14845	11: Gospel	diane	42	1.0	3.014903	126.625927	0.101580
14943	11: Gospel	maude	32	1.0	3.014903	96.476897	0.077394
5631	11: Gospel	said	77	13.0	1.068993	82.312451	0.066032
8779	11: Gospel	anita	25	2.0	2.609438	65.235948	0.052333
15005	11: Gospel	reverend	18	2.0	2.609438	46.969882	0.037680
14895	11: Gospel	gospelteers	15	1.0	3.014903	45.223545	0.036279
14938	11: Gospel	mae	15	1.0	3.014903	45.223545	0.036279
2845	11: Gospel	group	26	7.0	1.628609	42.343825	0.033968
7897	11: Gospel	would	42	14.0	1.000000	42.000000	0.033693
1408	11: Gospel	could	39	13.0	1.068993	41.690722	0.033445
1176	11: Gospel	church	26	9.0	1.405465	36.542093	0.029314
14835	11: Gospel	counsel	12	1.0	3.014903	36.178836	0.029023
7795	11: Gospel	women	31	12.0	1.143101	35.436126	0.028427
4013	11: Gospel	man	31	13.0	1.068993	33.138779	0.026584
15347	12: A New Man	woodrow	57	1.0	3.014903	171.849472	0.137859
15280	12: A New Man	rita	22	1.0	3.014903	66.327866	0.053209
7898	12: A New Man	would	37	14.0	1.000000	37.000000	0.029682
5632	12: A New Man	said	33	13.0	1.068993	35.276765	0.028299
4014	12: A New Man	man	27	13.0	1.068993	28.862808	0.023154
15185	12: A New Man	elaine	9	1.0	3.014903	27.134127	0.021767
1409	12: A New Man	could	23	13.0	1.068993	24.586836	0.019724
2213	12: A New Man	father	22	13.0	1.068993	23.517843	0.018866
4767	12: A New Man	one	23	14.0	1.000000	23.000000	0.018451
15161	12: A New Man	cunningham	7	1.0	3.014903	21.104321	0.016930
1547	12: A New Man	daughter	16	10.0	1.310155	20.962479	0.016816
8531	12: A New Man	read	12	7.0	1.628609	19.543304	0.015678
2029	12: A New Man	even	16	13.0	1.068993	17.103886	0.013721
4732	12: A New Man	old	16	13.0	1.068993	17.103886	0.013721
5300	12: A New Man	put	16	13.0	1.068993	17.103886	0.013721
15424	13: A Dark Night	garrett	38	1.0	3.014903	114.566315	0.091906
15361	13: A Dark Night	beatrice	31	1.0	3.014903	93.461994	0.074976
15377	13: A Dark Night	carmena	20	1.0	3.014903	60.298060	0.048371
5633	13: A Dark Night	said	49	13.0	1.068993	52.380651	0.042020
10438	13: A Dark Night	uncle	14	2.0	2.609438	36.532131	0.029306
15517	13: A Dark Night	thunder	12	1.0	3.014903	36.178836	0.029023
15433	13: A Dark Night	henry	10	1.0	3.014903	30.149030	0.024186
1737	13: A Dark Night	door	27	14.0	1.000000	27.000000	0.021660
1516	13: A Dark Night	daddy	17	8.0	1.510826	25.684036	0.020604
15442	13: A Dark Night	joe	8	1.0	3.014903	24.119224	0.019349
4768	13: A Dark Night	one	23	14.0	1.000000	23.000000	0.018451
15364	13: A Dark Night	boone	7	1.0	3.014903	21.104321	0.016930
15451	13: A Dark Night	lightning	7	1.0	3.014903	21.104321	0.016930
7899	13: A Dark Night	would	21	14.0	1.000000	21.000000	0.016846
14055	13: A Dark Night	john	10	4.0	2.098612	20.986123	0.016835
13633	14: Marie	marie	49	2.0	2.609438	127.862458	0.102572
15719	14: Marie	vernelle	30	1.0	3.014903	90.447091	0.072557
15722	14: Marie	wilamena	20	1.0	3.014903	60.298060	0.048371
5634	14: Marie	said	51	13.0	1.068993	54.518636	0.043735
7900	14: Marie	would	36	14.0	1.000000	36.000000	0.028879
4016	14: Marie	man	32	13.0	1.068993	34.207772	0.027442
11492	14: Marie	security	13	2.0	2.609438	33.922693	0.027213
7784	14: Marie	woman	31	14.0	1.000000	31.000000	0.024868
15662	14: Marie	receptionist	10	1.0	3.014903	30.149030	0.024186
7027	14: Marie	told	25	13.0	1.068993	26.724822	0.021439
11578	14: Marie	social	10	2.0	2.609438	26.094379	0.020933
4769	14: Marie	one	25	14.0	1.000000	25.000000	0.020055
15552	14: Marie	calhoun	8	1.0	3.014903	24.119224	0.019349
15723	14: Marie	wise	8	1.0	3.014903	24.119224	0.019349
15037	14: Marie	smith	9	2.0	2.609438	23.484941	0.018840

It turns out that “pigeons” are pretty unique to the first short story in Lost in the City and have a normalized tf-idf score of .062, making it one of the most distinctive words in that story along with “coop” and “birds.”

What are some other distinctive words in Lost in the City?

Further Resources#

Peter Organisciak and Boris Capitanu, “Text Mining in Python through the HTRC Feature Reader,” The Programming Historian