TF-IDF with HathiTrust Data

In this lesson, we’re going to learn about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf.

While calculating the most frequent words in a text can be useful, the most frequent words in a text usually aren’t the most interesting words in a text, even if we get rid of stop words (“the, “and,” “to,” etc.). Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent or significant words in a document.

In this lesson, we will cover how to:

  • Calculate and normalize tf-idf scores for each short story in Edward P. Jones’s Lost in the City

  • Download and process HathiTrust extracted features — that is, word frequencies for books in the HathiTrust Digital Library (including in-copyright books like Lost in the City)

  • Prepare HathiTrust extracted features for tf-idf analysis

Dataset

Lost in the City by Edward P. Jones

[T]he pigeon had taken a step and dropped from the ledge. He caught an upwind that took him nearly as high as the tops of the empty K Street houses. He flew farther into Northeast, into the color and sounds of the city’s morning. She did nothing, aside from following him, with her eyes, with her heart, as far as she could.

—Edward P. Jones, “The Girl Who Raised Pigeons,” Lost in the City (1993)

Edward P. Jones’s Lost in the City (1993) is a collection of 14 short stories set in Washington D.C. The first short story, “The Girl Who Raised Pigeons,” begins with a young girl raising homing pigeons on her roof.

How distinctive is a “pigeon” in the world of Lost in the City? What does this uniqueness (or lackthereof) tell us about the meaning of pigeons in first short story “The Girl Who Raised Pigeons” and the collection as a whole? These are just a few of the questions that we’re going to try to answer with tf-idf.

If you already have a collection of plain text (.txt) files that you’d like to analyze, one of the easiest ways to calculate tf-idf scores is to use the Python library scikit-learn. It has a quick and nifty module called TfidfVectorizer, which does all the math for you behind the scenes. We will cover how to use the TfidfVectorizer in the next lesson.

In this lesson, however, we’re going to calculate tf-idf scores manually because Lost in the City is still in-copyright, which means that, for legal reasons, we can’t easily share or access plain text files of the book.

Luckily, the HathiTrust Digital Library—which contains digitized books from Google Books as well as many university libraries—has released word frequencies per page for all 17 million books in its catalog. These word frequencies (plus part of speech tags) are otherwise known as “extracted features.” There’s a lot of text analysis that we can do with extracted features alone, including tf-idf.

So to calculate tf-idf scores for Lost in the City, we’re going to use HathiTrust extracted features. That’s why we’re not using sci-kit learn’s TfidfVectorizer. It works great with plain text files but not so great with extracted features.

Breaking Down the TF-IDF Formula

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

tf-idf = term_frequency * inverse_document_frequency

term_frequency = number of times a given term appears in document

inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

Let’s test it out

We need the log() function for our calculation, otherwise known as logarithm, so we’re going to import the numpy package.

import numpy as np

“said”

total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1
term_frequency * inverse_document_frequency
50.24266495988672

“pigeons”

total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1
term_frequency * inverse_document_frequency
78.28313737302301

tf–idf scores for “The Girl Who Raised Pigeons”

“said” = 50.48
“pigeons” = 78.28

Though the word “said” appears 47 times in “The Girl Who Raised Pigeons” and the word “pigeons” only appears 30 times, “pigeons” has a higher tf–idf score than “said” because it’s a rarer word. The word “pigeons” appears in 2 of 14 stories, while “said” appears in 13 of 14 stories, almost all of them.

Get HathiTrust Extracted Features

Now let’s try to calculate tf-idf scores for all the words in all the short stories in Lost in the City. To do so, we need word counts, or HathiTrust extracted features, for each story in the collection.

To work with HathiTrust’s extracted features, we first need to install and import the HathiTrust Feature Reader.

Install HathiTrust Feature Reader

!pip install htrc-feature-reader

Import necessary libraries

from htrc_features import Volume
import pandas as pd

Then we need to locate the the HathiTrust volume ID for Lost in the City. If we search the HathiTrust catalog for this book and then click on “Limited (search only),” it will take us to the following web page: https://babel.hathitrust.org/cgi/pt?id=mdp.39015029970129.

The HathiTrust Volume ID for Lost in the City is located after id= this URL: mdp.39015029970129.

Make DataFrame of Word Frequencies From Volume(s)

Single Volume

To get HathiTrust extracted features for a single volume, we can create a [Volume object](https://github.com/htrc/htrc-feature-reader#volume) and use the .tokenlist() method.

Volume('mdp.39015029970129').tokenlist()
count
page section token pos
1 body , , 1
.046 CD 1
1993 CD 1
3560 CD 1
AWARD NN 1
... ... ... ... ...
260 body world NN 2
would MD 1
writers NNS 1
written VBN 1
SYM 1

51297 rows × 1 columns

For each page in Lost in the City, this DataFrame displays the page number and section type as well as every word/token that appears on the page, its part-of-speech, and the number of times that word/token occurs on the page. As you can see, there are 51,297 rows in this DataFrame — one for each token that appears on each page.

Let’s look at a sample of just 20 words from page 11.

Volume('mdp.39015029970129').tokenlist()[500:520]
count
page section token pos
11 body out RP 1
over IN 1
part NN 1
past IN 1
pee VB 1
pigeon NN 1
pigeons NNS 1
reach VB 1
remained VBD 1
roof NN 1
room NN 2
say VB 1
seemed VBD 1
set VBN 1
share VB 1
she PRP 7
silenccBometimes NNS 1
silence NN 1
simple JJ 1
slats NNS 1

We can also get metadata for a HathiTrust volume by asking for certain attributes.

Volume('mdp.39015029970129').year
1993
Volume('mdp.39015029970129').page_count
260
Volume('mdp.39015029970129').publisher
'HarperPerennial'

Multiple Volumes

We might want to get extracted features for multiple volumes at the same time, so we’re also going to practice a workflow that will allow us to read in multiple HathiTrust books, even though we’re only reading in one book at this moment.

Insert list of desired HathiTrust volume(s)

volume_ids = ['mdp.39015029970129']

Loop through this list of volume IDs and make a DataFrame that includes extracted features, book title, and publication year, then make a list of all DataFrames.

all_tokens = []

for hathi_id in volume_ids:
    
    #Read in HathiTrust volume
    volume = Volume(hathi_id)
    
    #Make dataframe from token list -- do not include part of speech, sections, or case sensitivity
    token_df = volume.tokenlist(case=False, pos=False, drop_section=True)
    
    #Add book column
    token_df['book'] = volume.title
    
    #Add publication year column
    token_df['year'] = volume.year
    
    all_tokens.append(token_df)

Concatenate the list of DataFrames

lost_df = pd.concat(all_tokens)

Preview the DataFrame

lost_df
count book year
page lowercase
1 , 1 Lost in the city : stories / 1993
.046 1 Lost in the city : stories / 1993
1993 1 Lost in the city : stories / 1993
3560 1 Lost in the city : stories / 1993
a 1 Lost in the city : stories / 1993
... ... ... ... ...
260 would 1 Lost in the city : stories / 1993
writers 1 Lost in the city : stories / 1993
written 1 Lost in the city : stories / 1993
york 1 Lost in the city : stories / 1993
1 Lost in the city : stories / 1993

47307 rows × 3 columns

Change from multi-level index to regular index with reset_index()

lost_df_flattened = lost_df.reset_index()
lost_df_flattened 
page lowercase count book year
0 1 , 1 Lost in the city : stories / 1993
1 1 .046 1 Lost in the city : stories / 1993
2 1 1993 1 Lost in the city : stories / 1993
3 1 3560 1 Lost in the city : stories / 1993
4 1 a 1 Lost in the city : stories / 1993
... ... ... ... ... ...
47302 260 would 1 Lost in the city : stories / 1993
47303 260 writers 1 Lost in the city : stories / 1993
47304 260 written 1 Lost in the city : stories / 1993
47305 260 york 1 Lost in the city : stories / 1993
47306 260 1 Lost in the city : stories / 1993

47307 rows × 5 columns

Nice! We now have a DataFrame of word counts per page for Lost in the City.

But what we need to move forward with tf-idf is a way of splitting this collection into its individual stories. Remember: to use tf-idf, we need a collection of texts because we need to compare word frequency for one document with all the other documents in the collection.

Add story titles

How can we split up Lost in the City into individual stories?

Sometimes HathiTrust Extracted Features helpfully include “section” information for a book, such as chapter titles. Unfortunately, the extracted features for Lost in the City do not include chapter or story titles.

They do, however, include page numbers and, if you specify volume.tokenlist(case=True), words with case sensitivity. When I manually combed through the HTRC token list with case sensitivity turned on, I noticed that the title page for each short story seemed to format the title in all-caps. So I searched for all-caps words from each story title and noted down the corresponding page number. This should give us a marker of where every story begins and ends.

The function below will add in Lost in the City’s story titles for the correct page numbers and corresponding words.

def add_story_titles(page):
    if page >= 0 and page < 11:
        return "Front Matter"
    if page >= 11 and page < 35:
        return "01: The Girl Who Raised Pigeons"
    elif page >= 35 and page < 41:
        return "02: The First Day"
    elif page >= 41 and page < 63:
        return "03: The Night Rhonda Ferguson Was Killed"
    elif page >= 63 and page < 85:
        return "04: Young Lions"
    elif page >= 85 and page < 113:
        return "05: The Store"
    elif page >= 113 and page < 125:
        return "06: An Orange Line Train to Ballston"
    elif page >= 125 and page < 149:
        return "07: The Sunday Following Mother's Day"
    elif page >= 149 and page < 159:
        return "08: Lost in the City"
    elif page >= 159 and page < 184:
        return "09: His Mother's House"
    elif page >= 184 and page < 191:
        return "10: A Butterfly on F Street"
    elif page >= 191 and page < 209:
        return "11: Gospel"
    elif page >= 209 and page < 225:
        return "12: A New Man"
    elif page >= 225 and page < 237:
        return "13: A Dark Night"
    elif page >= 237 and page <= 252:
        return "14: Marie"
    elif page > 252:
        return "Back Matter"

Below we add a new column of story titles to the DataFrame by apply()ing our function to the “page” column and dumping the results to lost_df_flattened['story']. You can read more about applying functions in “Pandas Basics - Part 3”.

lost_df_flattened['story'] = lost_df_flattened['page'].apply(add_story_titles)

We’re also going to drop the “Front Matter” and “Back Matter” from the DataFrame.

lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Front Matter'].index)
lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Back Matter'].index)

Sum Word Counts For Each Story

Page-level information is great. But for tf-idf purposes, we really only care about the frequency of words for every story. Below we group by story and calculate the sum of word frequencies for all the pages in that story.

lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()
story lowercase count
0 01: The Girl Who Raised Pigeons ! 8
1 01: The Girl Who Raised Pigeons ' 4
2 01: The Girl Who Raised Pigeons '' 111
3 01: The Girl Who Raised Pigeons 'd 1
4 01: The Girl Who Raised Pigeons 'll 5
... ... ... ...
18082 14: Marie yet 1
18083 14: Marie you 39
18084 14: Marie you-know-who 1
18085 14: Marie young 8
18086 14: Marie your 8

18087 rows × 3 columns

Notice how the “page” column no longer exists in the DataFrame and our rows have slimmed down from more than 40,000 to 18,000.

word_frequency_df = lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

Remove Infrequent Words, Stopwords, & Punctuation

We will conclude with some final pre-processing steps. We will remove the list of stopwords defined below.

Make list of stopwords

STOPS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
         'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
         'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
         'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
         'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
         'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
         'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp', "!"]

Remove stopwords

word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].isin(STOPS)].index)

We will also remove punctuation by using a regular expression [^A-Za-z\s], which matches anything that’s not a letter and drops it from the DataFrame.

word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].str.contains('[^A-Za-z\s]', regex=True)].index)
#Remove words that appear less than 5 times in a book
#word_frequency_df_test = word_frequency_df[word_frequency_df['count'] > 5]
word_frequency_df
story lowercase count
36 01: The Girl Who Raised Pigeons abandoned 2
37 01: The Girl Who Raised Pigeons able 2
40 01: The Girl Who Raised Pigeons absently 1
41 01: The Girl Who Raised Pigeons absolute 1
42 01: The Girl Who Raised Pigeons accepted 1
... ... ... ...
18079 14: Marie years 10
18080 14: Marie yes 2
18081 14: Marie yesterday 2
18082 14: Marie yet 1
18085 14: Marie young 8

15726 rows × 3 columns

TF-IDF

Term Frequency

We already have term frequencies for each document. Let’s rename the columns so that they’re consistent with the tf-idf vocabulary that we’ve been using.

word_frequency_df = word_frequency_df.rename(columns={'lowercase': 'term','count': 'term_frequency'})
word_frequency_df
story term term_frequency
36 01: The Girl Who Raised Pigeons abandoned 2
37 01: The Girl Who Raised Pigeons able 2
40 01: The Girl Who Raised Pigeons absently 1
41 01: The Girl Who Raised Pigeons absolute 1
42 01: The Girl Who Raised Pigeons accepted 1
... ... ... ...
18079 14: Marie years 10
18080 14: Marie yes 2
18081 14: Marie yesterday 2
18082 14: Marie yet 1
18085 14: Marie young 8

15726 rows × 3 columns

Document Frequency

To calculate the number of documents or stories in which each term appears, we’re going to create a separate DataFrame and do some Pandas manipulation and calculation.

document_frequency_df = (word_frequency_df.groupby(['story','term']).size().unstack()).sum().reset_index()

If you inspect parts of the complex chain of Pandas methods above (which is always a great way to learn!), you will see that we’re momentarily reshaping the DataFrame to see if each term appears in each story…

word_frequency_df.groupby(['story','term']).size().unstack()
term abandoned abhored abide ability able abomination aboum aboutfcfteen abqu absently ... ypu yr ysirs ythe yuddini zigzagging zion zipped zippers zoo
story
01: The Girl Who Raised Pigeons 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN 1.0 ... NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN
02: The First Day NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
03: The Night Rhonda Ferguson Was Killed NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
04: Young Lions NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN NaN ... 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
05: The Store NaN NaN NaN 1.0 1.0 1.0 NaN 1.0 NaN NaN ... NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
06: An Orange Line Train to Ballston NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
07: The Sunday Following Mother's Day 1.0 1.0 1.0 NaN 1.0 NaN NaN NaN 1.0 NaN ... NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
08: Lost in the City NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
09: His Mother's House 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
10: A Butterfly on F Street NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
11: Gospel NaN NaN NaN NaN 1.0 NaN NaN NaN NaN 1.0 ... NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN
12: A New Man NaN NaN NaN NaN 1.0 NaN NaN NaN NaN 1.0 ... NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN
13: A Dark Night NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14: Marie NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

14 rows × 6207 columns

Then we’re adding up how many stories each term appears in (.sum()) and resetting the index (.reset_index()) to make a DataFrame.

Finally, we will rename the column in this DataFrame and merge it into our word frequency DataFrame.

document_frequency_df = document_frequency_df.rename(columns={0:'document_frequency'})
word_frequency_df = word_frequency_df.merge(document_frequency_df)

Now we have term frequency and document frequency.

word_frequency_df
story term term_frequency document_frequency
0 01: The Girl Who Raised Pigeons abandoned 2 3.0
1 07: The Sunday Following Mother's Day abandoned 1 3.0
2 09: His Mother's House abandoned 1 3.0
3 01: The Girl Who Raised Pigeons able 2 12.0
4 03: The Night Rhonda Ferguson Was Killed able 3 12.0
... ... ... ... ...
15721 14: Marie whim 1 1.0
15722 14: Marie wilamena 20 1.0
15723 14: Marie wise 8 1.0
15724 14: Marie womanish 1 1.0
15725 14: Marie worships 1 1.0

15726 rows × 4 columns

As you can see in the DataFrame above, the term “abandoned” appears 2 times in the story “The Girl Who Raised Pigeons” (term frequency), and it appears in 3 different stories in the collection overall (document frequency).

Total Number of Documents

To calculate the total number of documents are in the collection, we count how many unique values are in the “story” column (we know the answer should be 14 short stories).

total_number_of_documents = lost_df_flattened['story'].nunique()
total_number_of_documents
14

Inverse Document Frequency

As we previously established, there are a lot of slightly different versions of the tf-idf formula, but we’re going to use the default version from the scikit-learn library that adds “smoothing” to inverse document frequency.

inverse_document_frequency = log [ (1 + total number of docs) / (1 + document frequency) ] + 1
import numpy as np
word_frequency_df['idf'] = np.log((1 + total_number_of_documents) / (1 + word_frequency_df['document_frequency'])) + 1

TF- IDF

Finally, we will calculate tf-idf by multiplying term frequency and inverse document frequency together.

word_frequency_df['tfidf'] = word_frequency_df['term_frequency'] * word_frequency_df['idf']

Then we will normalize these values with the scikit-learn library.

from sklearn import preprocessing
word_frequency_df['tfidf_normalized'] = preprocessing.normalize(word_frequency_df[['tfidf']], axis=0, norm='l2')

We did it! Now let’s inspect the top 15 words with the highest tfidf scores for each story in the collection

word_frequency_df.sort_values(by=['story','tfidf_normalized'], ascending=[True,False]).groupby(['story']).head(15)
story term term_frequency document_frequency idf tfidf tfidf_normalized
655 01: The Girl Who Raised Pigeons betsy 44 1.0 3.014903 132.655733 0.106417
3317 01: The Girl Who Raised Pigeons jenny 42 1.0 3.014903 126.625927 0.101580
212 01: The Girl Who Raised Pigeons ann 45 2.0 2.609438 117.424706 0.094199
5566 01: The Girl Who Raised Pigeons robert 36 1.0 3.014903 108.536509 0.087069
1384 01: The Girl Who Raised Pigeons coop 28 1.0 3.014903 84.417285 0.067720
7887 01: The Girl Who Raised Pigeons would 84 14.0 1.000000 84.000000 0.067385
5053 01: The Girl Who Raised Pigeons pigeons 30 2.0 2.609438 78.283137 0.062799
4238 01: The Girl Who Raised Pigeons miss 46 10.0 1.310155 60.267127 0.048347
688 01: The Girl Who Raised Pigeons birds 29 5.0 1.916291 55.572431 0.044581
1191 01: The Girl Who Raised Pigeons clara 17 1.0 3.014903 51.253351 0.041116
5622 01: The Girl Who Raised Pigeons said 47 13.0 1.068993 50.242665 0.040305
5350 01: The Girl Who Raised Pigeons ralph 17 2.0 2.609438 44.360445 0.035586
4189 01: The Girl Who Raised Pigeons miles 21 4.0 2.098612 44.070858 0.035354
5052 01: The Girl Who Raised Pigeons pigeon 14 1.0 3.014903 42.208642 0.033860
6837 01: The Girl Who Raised Pigeons thelma 14 1.0 3.014903 42.208642 0.033860
4340 02: The First Day mother 42 13.0 1.068993 44.897701 0.036017
7772 02: The First Day woman 23 14.0 1.000000 23.000000 0.018451
8659 02: The First Day takes 7 2.0 2.609438 18.266065 0.014653
3895 02: The First Day looks 6 2.0 2.609438 15.656627 0.012560
8568 02: The First Day says 7 5.0 1.916291 13.414035 0.010761
8210 02: The First Day form 6 4.0 2.098612 12.591674 0.010101
5701 02: The First Day school 10 11.0 1.223144 12.231436 0.009812
8575 02: The First Day seaton 4 1.0 3.014903 12.059612 0.009674
4757 02: The First Day one 12 14.0 1.000000 12.000000 0.009626
8268 02: The First Day jersey 5 3.0 2.321756 11.608779 0.009313
8667 02: The First Day tells 4 3.0 2.321756 9.287023 0.007450
8036 02: The First Day appears 3 1.0 3.014903 9.044709 0.007256
8046 02: The First Day asks 3 1.0 3.014903 9.044709 0.007256
8089 02: The First Day blondelle 3 1.0 3.014903 9.044709 0.007256
8358 02: The First Day mary 3 1.0 3.014903 9.044709 0.007256
8974 03: The Night Rhonda Ferguson Was Killed cassandra 130 1.0 3.014903 391.937393 0.314415
9693 03: The Night Rhonda Ferguson Was Killed melanie 65 1.0 3.014903 195.968696 0.157207
8778 03: The Night Rhonda Ferguson Was Killed anita 68 2.0 2.609438 177.441778 0.142345
10002 03: The Night Rhonda Ferguson Was Killed rhonda 42 1.0 3.014903 126.625927 0.101580
5623 03: The Night Rhonda Ferguson Was Killed said 109 13.0 1.068993 116.520223 0.093473
9386 03: The Night Rhonda Ferguson Was Killed gladys 38 2.0 2.609438 99.158641 0.079546
8944 03: The Night Rhonda Ferguson Was Killed car 42 11.0 1.223144 51.372029 0.041211
6510 03: The Night Rhonda Ferguson Was Killed street 42 13.0 1.068993 44.897701 0.036017
453 03: The Night Rhonda Ferguson Was Killed back 39 14.0 1.000000 39.000000 0.031286
2602 03: The Night Rhonda Ferguson Was Killed get 36 13.0 1.068993 38.483743 0.030872
2646 03: The Night Rhonda Ferguson Was Killed girls 17 4.0 2.098612 35.676409 0.028620
2204 03: The Night Rhonda Ferguson Was Killed father 32 13.0 1.068993 34.207772 0.027442
10481 03: The Night Rhonda Ferguson Was Killed wesley 13 2.0 2.609438 33.922693 0.027213
9571 03: The Night Rhonda Ferguson Was Killed joyce 12 2.0 2.609438 31.313255 0.025120
9833 03: The Night Rhonda Ferguson Was Killed pearl 11 2.0 2.609438 28.703817 0.023026
10696 04: Young Lions caesar 75 1.0 3.014903 226.117727 0.181393
11527 04: Young Lions sherman 60 1.0 3.014903 180.894181 0.145114
11198 04: Young Lions manny 44 1.0 3.014903 132.655733 0.106417
10701 04: Young Lions carol 29 1.0 3.014903 87.432188 0.070139
5624 04: Young Lions said 71 13.0 1.068993 75.898494 0.060886
11452 04: Young Lions retarded 22 2.0 2.609438 57.407634 0.046053
7890 04: Young Lions would 57 14.0 1.000000 57.000000 0.045726
11016 04: Young Lions heh 17 1.0 3.014903 51.253351 0.041116
7774 04: Young Lions woman 44 14.0 1.000000 44.000000 0.035297
10568 04: Young Lions anna 13 1.0 3.014903 39.193739 0.031441
4006 04: Young Lions man 34 13.0 1.068993 36.345758 0.029157
454 04: Young Lions back 35 14.0 1.000000 35.000000 0.028077
1401 04: Young Lions could 29 13.0 1.068993 31.000793 0.024869
7199 04: Young Lions two 30 14.0 1.000000 30.000000 0.024066
2205 04: Young Lions father 28 13.0 1.068993 29.931800 0.024011
12577 05: The Store penny 57 2.0 2.609438 148.737961 0.119319
5625 05: The Store said 79 13.0 1.068993 84.450437 0.067747
7891 05: The Store would 79 14.0 1.000000 79.000000 0.063374
6479 05: The Store store 51 10.0 1.310155 66.817901 0.053602
12353 05: The Store jenkins 19 1.0 3.014903 57.283157 0.045953
12375 05: The Store kentucky 23 3.0 2.321756 53.400384 0.042838
12427 05: The Store lonney 17 1.0 3.014903 51.253351 0.041116
455 05: The Store back 50 14.0 1.000000 50.000000 0.040110
4760 05: The Store one 48 14.0 1.000000 48.000000 0.038506
4343 05: The Store mother 42 13.0 1.068993 44.897701 0.036017
6969 05: The Store time 42 14.0 1.000000 42.000000 0.033693
1552 05: The Store day 40 14.0 1.000000 40.000000 0.032088
1402 05: The Store could 34 13.0 1.068993 36.345758 0.029157
2206 05: The Store father 32 13.0 1.068993 34.207772 0.027442
11865 05: The Store baxter 11 1.0 3.014903 33.163933 0.026604
13144 06: An Orange Line Train to Ballston marcus 38 1.0 3.014903 114.566315 0.091906
13046 06: An Orange Line Train to Ballston avis 26 1.0 3.014903 78.387479 0.062883
13146 06: An Orange Line Train to Ballston marvin 24 1.0 3.014903 72.357672 0.058046
13145 06: An Orange Line Train to Ballston marvella 23 1.0 3.014903 69.342769 0.055627
5626 06: An Orange Line Train to Ballston said 64 13.0 1.068993 68.415544 0.054883
4008 06: An Orange Line Train to Ballston man 63 13.0 1.068993 67.346551 0.054026
7084 06: An Orange Line Train to Ballston train 25 5.0 1.916291 47.907268 0.038432
13216 06: An Orange Line Train to Ballston subway 15 1.0 3.014903 45.223545 0.036279
13087 06: An Orange Line Train to Ballston dreadlocks 11 1.0 3.014903 33.163933 0.026604
13086 06: An Orange Line Train to Ballston dreadlock 8 1.0 3.014903 24.119224 0.019349
3765 06: An Orange Line Train to Ballston line 18 10.0 1.310155 23.582789 0.018918
4800 06: An Orange Line Train to Ballston orange 11 4.0 2.098612 23.084735 0.018519
4344 06: An Orange Line Train to Ballston mother 17 13.0 1.068993 18.172879 0.014578
13050 06: An Orange Line Train to Ballston ballston 6 1.0 3.014903 18.089418 0.014511
3741 06: An Orange Line Train to Ballston like 17 14.0 1.000000 17.000000 0.013638
13627 07: The Sunday Following Mother's Day madeleine 74 1.0 3.014903 223.102824 0.178974
13626 07: The Sunday Following Mother's Day maddie 62 1.0 3.014903 186.923987 0.149952
13770 07: The Sunday Following Mother's Day samuel 41 1.0 3.014903 123.611024 0.099162
13769 07: The Sunday Following Mother's Day sam 34 1.0 3.014903 102.506703 0.082232
5627 07: The Sunday Following Mother's Day said 81 13.0 1.068993 86.588423 0.069462
7893 07: The Sunday Following Mother's Day would 71 14.0 1.000000 71.000000 0.056957
13711 07: The Sunday Following Mother's Day pookie 21 1.0 3.014903 63.312963 0.050790
9093 07: The Sunday Following Mother's Day curtis 16 2.0 2.609438 41.751007 0.033493
13281 07: The Sunday Following Mother's Day arnisa 13 1.0 3.014903 39.193739 0.031441
13893 07: The Sunday Following Mother's Day williams 12 1.0 3.014903 36.178836 0.029023
1404 07: The Sunday Following Mother's Day could 33 13.0 1.068993 35.276765 0.028299
1554 07: The Sunday Following Mother's Day day 35 14.0 1.000000 35.000000 0.028077
4009 07: The Sunday Following Mother's Day man 32 13.0 1.068993 34.207772 0.027442
457 07: The Sunday Following Mother's Day back 33 14.0 1.000000 33.000000 0.026473
4762 07: The Sunday Following Mother's Day one 32 14.0 1.000000 32.000000 0.025671
14079 08: Lost in the City lydia 32 1.0 3.014903 96.476897 0.077394
5628 08: Lost in the City said 46 13.0 1.068993 49.173672 0.039447
4346 08: Lost in the City mother 44 13.0 1.068993 47.035686 0.037732
10969 08: Lost in the City georgia 19 3.0 2.321756 44.113361 0.035388
972 08: Lost in the City cab 13 5.0 1.916291 24.911780 0.019984
9157 08: Lost in the City driver 8 5.0 1.916291 15.330326 0.012298
13908 08: Lost in the City antibes 5 1.0 3.014903 15.074515 0.012093
13976 08: Lost in the City dreaming 5 1.0 3.014903 15.074515 0.012093
3455 08: Lost in the City know 15 14.0 1.000000 15.000000 0.012033
7894 08: Lost in the City would 15 14.0 1.000000 15.000000 0.012033
2607 08: Lost in the City get 14 13.0 1.068993 14.965900 0.012006
4010 08: Lost in the City man 14 13.0 1.068993 14.965900 0.012006
6925 08: Lost in the City thought 14 14.0 1.000000 14.000000 0.011231
10189 08: Lost in the City sorry 9 8.0 1.510826 13.597431 0.010908
4763 08: Lost in the City one 13 14.0 1.000000 13.000000 0.010429
9572 09: His Mother's House joyce 84 2.0 2.609438 219.192785 0.175838
14513 09: His Mother's House rickey 64 1.0 3.014903 192.953793 0.154789
14529 09: His Mother's House santiago 54 1.0 3.014903 162.804763 0.130603
5629 09: His Mother's House said 96 13.0 1.068993 102.623316 0.082325
14384 09: His Mother's House humphrey 33 1.0 3.014903 99.491800 0.079813
9834 09: His Mother's House pearl 22 2.0 2.609438 57.407634 0.046053
14527 09: His Mother's House sandy 18 1.0 3.014903 54.268254 0.043534
4764 09: His Mother's House one 50 14.0 1.000000 50.000000 0.040110
7895 09: His Mother's House would 49 14.0 1.000000 49.000000 0.039308
14577 09: His Mother's House smokey 16 1.0 3.014903 48.238448 0.038697
3172 09: His Mother's House house 35 12.0 1.143101 40.008530 0.032095
3744 09: His Mother's House like 38 14.0 1.000000 38.000000 0.030484
4347 09: His Mother's House mother 35 13.0 1.068993 37.414751 0.030014
6671 09: His Mother's House table 29 11.0 1.223144 35.471163 0.028455
7779 09: His Mother's House woman 35 14.0 1.000000 35.000000 0.028077
9700 10: A Butterfly on F Street mildred 27 2.0 2.609438 70.454824 0.056519
7780 10: A Butterfly on F Street woman 29 14.0 1.000000 29.000000 0.023264
14675 10: A Butterfly on F Street butterfly 7 1.0 3.014903 21.104321 0.016930
14710 10: A Butterfly on F Street mansfield 6 1.0 3.014903 18.089418 0.014511
5630 10: A Butterfly on F Street said 16 13.0 1.068993 17.103886 0.013721
6517 10: A Butterfly on F Street street 14 13.0 1.068993 14.965900 0.012006
14713 10: A Butterfly on F Street median 4 1.0 3.014903 12.059612 0.009674
14750 10: A Butterfly on F Street woolworth 4 1.0 3.014903 12.059612 0.009674
460 10: A Butterfly on F Street back 10 14.0 1.000000 10.000000 0.008022
5672 10: A Butterfly on F Street say 9 13.0 1.068993 9.620936 0.007718
14714 10: A Butterfly on F Street morton 3 1.0 3.014903 9.044709 0.007256
9258 10: A Butterfly on F Street f 5 6.0 1.762140 8.810700 0.007068
1557 10: A Butterfly on F Street day 8 14.0 1.000000 8.000000 0.006418
7896 10: A Butterfly on F Street would 8 14.0 1.000000 8.000000 0.006418
9387 10: A Butterfly on F Street gladys 3 2.0 2.609438 7.828314 0.006280
15094 11: Gospel vivian 68 1.0 3.014903 205.013405 0.164463
14845 11: Gospel diane 42 1.0 3.014903 126.625927 0.101580
14943 11: Gospel maude 32 1.0 3.014903 96.476897 0.077394
5631 11: Gospel said 77 13.0 1.068993 82.312451 0.066032
8779 11: Gospel anita 25 2.0 2.609438 65.235948 0.052333
15005 11: Gospel reverend 18 2.0 2.609438 46.969882 0.037680
14895 11: Gospel gospelteers 15 1.0 3.014903 45.223545 0.036279
14938 11: Gospel mae 15 1.0 3.014903 45.223545 0.036279
2845 11: Gospel group 26 7.0 1.628609 42.343825 0.033968
7897 11: Gospel would 42 14.0 1.000000 42.000000 0.033693
1408 11: Gospel could 39 13.0 1.068993 41.690722 0.033445
1176 11: Gospel church 26 9.0 1.405465 36.542093 0.029314
14835 11: Gospel counsel 12 1.0 3.014903 36.178836 0.029023
7795 11: Gospel women 31 12.0 1.143101 35.436126 0.028427
4013 11: Gospel man 31 13.0 1.068993 33.138779 0.026584
15347 12: A New Man woodrow 57 1.0 3.014903 171.849472 0.137859
15280 12: A New Man rita 22 1.0 3.014903 66.327866 0.053209
7898 12: A New Man would 37 14.0 1.000000 37.000000 0.029682
5632 12: A New Man said 33 13.0 1.068993 35.276765 0.028299
4014 12: A New Man man 27 13.0 1.068993 28.862808 0.023154
15185 12: A New Man elaine 9 1.0 3.014903 27.134127 0.021767
1409 12: A New Man could 23 13.0 1.068993 24.586836 0.019724
2213 12: A New Man father 22 13.0 1.068993 23.517843 0.018866
4767 12: A New Man one 23 14.0 1.000000 23.000000 0.018451
15161 12: A New Man cunningham 7 1.0 3.014903 21.104321 0.016930
1547 12: A New Man daughter 16 10.0 1.310155 20.962479 0.016816
8531 12: A New Man read 12 7.0 1.628609 19.543304 0.015678
2029 12: A New Man even 16 13.0 1.068993 17.103886 0.013721
4732 12: A New Man old 16 13.0 1.068993 17.103886 0.013721
5300 12: A New Man put 16 13.0 1.068993 17.103886 0.013721
15424 13: A Dark Night garrett 38 1.0 3.014903 114.566315 0.091906
15361 13: A Dark Night beatrice 31 1.0 3.014903 93.461994 0.074976
15377 13: A Dark Night carmena 20 1.0 3.014903 60.298060 0.048371
5633 13: A Dark Night said 49 13.0 1.068993 52.380651 0.042020
10438 13: A Dark Night uncle 14 2.0 2.609438 36.532131 0.029306
15517 13: A Dark Night thunder 12 1.0 3.014903 36.178836 0.029023
15433 13: A Dark Night henry 10 1.0 3.014903 30.149030 0.024186
1737 13: A Dark Night door 27 14.0 1.000000 27.000000 0.021660
1516 13: A Dark Night daddy 17 8.0 1.510826 25.684036 0.020604
15442 13: A Dark Night joe 8 1.0 3.014903 24.119224 0.019349
4768 13: A Dark Night one 23 14.0 1.000000 23.000000 0.018451
15364 13: A Dark Night boone 7 1.0 3.014903 21.104321 0.016930
15451 13: A Dark Night lightning 7 1.0 3.014903 21.104321 0.016930
7899 13: A Dark Night would 21 14.0 1.000000 21.000000 0.016846
14055 13: A Dark Night john 10 4.0 2.098612 20.986123 0.016835
13633 14: Marie marie 49 2.0 2.609438 127.862458 0.102572
15719 14: Marie vernelle 30 1.0 3.014903 90.447091 0.072557
15722 14: Marie wilamena 20 1.0 3.014903 60.298060 0.048371
5634 14: Marie said 51 13.0 1.068993 54.518636 0.043735
7900 14: Marie would 36 14.0 1.000000 36.000000 0.028879
4016 14: Marie man 32 13.0 1.068993 34.207772 0.027442
11492 14: Marie security 13 2.0 2.609438 33.922693 0.027213
7784 14: Marie woman 31 14.0 1.000000 31.000000 0.024868
15662 14: Marie receptionist 10 1.0 3.014903 30.149030 0.024186
7027 14: Marie told 25 13.0 1.068993 26.724822 0.021439
11578 14: Marie social 10 2.0 2.609438 26.094379 0.020933
4769 14: Marie one 25 14.0 1.000000 25.000000 0.020055
15552 14: Marie calhoun 8 1.0 3.014903 24.119224 0.019349
15723 14: Marie wise 8 1.0 3.014903 24.119224 0.019349
15037 14: Marie smith 9 2.0 2.609438 23.484941 0.018840

It turns out that “pigeons” are pretty unique to the first short story in Lost in the City and have a normalized tf-idf score of .062, making it one of the most distinctive words in that story along with “coop” and “birds.”

What are some other distinctive words in Lost in the City?

Further Resources