TF-IDF with HathiTrust Data#
In this lesson, we’re going to learn about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf.
While calculating the most frequent words in a text can be useful, the most frequent words in a text usually aren’t the most interesting words in a text, even if we get rid of stop words (“the, “and,” “to,” etc.). Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent or significant words in a document.
In this lesson, we will cover how to:
Calculate and normalize tf-idf scores for each short story in Edward P. Jones’s Lost in the City
Download and process HathiTrust extracted features — that is, word frequencies for books in the HathiTrust Digital Library (including in-copyright books like Lost in the City)
Prepare HathiTrust extracted features for tf-idf analysis
Dataset#
Lost in the City by Edward P. Jones#
[T]he pigeon had taken a step and dropped from the ledge. He caught an upwind that took him nearly as high as the tops of the empty K Street houses. He flew farther into Northeast, into the color and sounds of the city’s morning. She did nothing, aside from following him, with her eyes, with her heart, as far as she could.
—Edward P. Jones, "The Girl Who Raised Pigeons," Lost in the City (1993)
Edward P. Jones’s Lost in the City (1993) is a collection of 14 short stories set in Washington D.C. The first short story, “The Girl Who Raised Pigeons,” begins with a young girl raising homing pigeons on her roof.
How distinctive is a “pigeon” in the world of Lost in the City? What does this uniqueness (or lackthereof) tell us about the meaning of pigeons in first short story “The Girl Who Raised Pigeons” and the collection as a whole? These are just a few of the questions that we’re going to try to answer with tf-idf.
If you already have a collection of plain text (.txt) files that you’d like to analyze, one of the easiest ways to calculate tf-idf scores is to use the Python library scikit-learn. It has a quick and nifty module called TfidfVectorizer, which does all the math for you behind the scenes. We will cover how to use the TfidfVectorizer in the next lesson.
In this lesson, however, we’re going to calculate tf-idf scores manually because Lost in the City is still in-copyright, which means that, for legal reasons, we can’t easily share or access plain text files of the book.
Luckily, the HathiTrust Digital Library—which contains digitized books from Google Books as well as many university libraries—has released word frequencies per page for all 17 million books in its catalog. These word frequencies (plus part of speech tags) are otherwise known as “extracted features.” There’s a lot of text analysis that we can do with extracted features alone, including tf-idf.
So to calculate tf-idf scores for Lost in the City, we’re going to use HathiTrust extracted features. That’s why we’re not using sci-kit learn’s TfidfVectorizer. It works great with plain text files but not so great with extracted features.
Breaking Down the TF-IDF Formula#
But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.
tf-idf = term_frequency * inverse_document_frequency
term_frequency = number of times a given term appears in document
inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****
You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).
The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).
*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:
inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1
Let’s test it out#
We need the log()
function for our calculation, otherwise known as logarithm, so we’re going to import the numpy
package.
import numpy as np
“said”
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1
term_frequency * inverse_document_frequency
50.24266495988672
“pigeons”
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1
term_frequency * inverse_document_frequency
78.28313737302301
tf–idf scores for “The Girl Who Raised Pigeons”
“said” = 50.48
“pigeons” = 78.28
Though the word “said” appears 47 times in “The Girl Who Raised Pigeons” and the word “pigeons” only appears 30 times, “pigeons” has a higher tf–idf score than “said” because it’s a rarer word. The word “pigeons” appears in 2 of 14 stories, while “said” appears in 13 of 14 stories, almost all of them.
Get HathiTrust Extracted Features#
Now let’s try to calculate tf-idf scores for all the words in all the short stories in Lost in the City. To do so, we need word counts, or HathiTrust extracted features, for each story in the collection.
To work with HathiTrust’s extracted features, we first need to install and import the HathiTrust Feature Reader.
Install HathiTrust Feature Reader
!pip install htrc-feature-reader
Import necessary libraries
from htrc_features import Volume
import pandas as pd
Pandas Review
Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!
Then we need to locate the the HathiTrust volume ID for Lost in the City. If we search the HathiTrust catalog for this book and then click on “Limited (search only),” it will take us to the following web page: https://babel.hathitrust.org/cgi/pt?id=mdp.39015029970129.
The HathiTrust Volume ID for Lost in the City is located after id=
this URL: mdp.39015029970129
.
Make DataFrame of Word Frequencies From Volume(s)#
Single Volume#
To get HathiTrust extracted features for a single volume, we can create a [Volume
object](htrc/htrc-feature-reader) and use the .tokenlist()
method.
Volume('mdp.39015029970129').tokenlist()
count | ||||
---|---|---|---|---|
page | section | token | pos | |
1 | body | , | , | 1 |
.046 | CD | 1 | ||
1993 | CD | 1 | ||
3560 | CD | 1 | ||
AWARD | NN | 1 | ||
... | ... | ... | ... | ... |
260 | body | world | NN | 2 |
would | MD | 1 | ||
writers | NNS | 1 | ||
written | VBN | 1 | ||
• | SYM | 1 |
51297 rows × 1 columns
For each page in Lost in the City, this DataFrame displays the page number and section type as well as every word/token that appears on the page, its part-of-speech, and the number of times that word/token occurs on the page. As you can see, there are 51,297 rows in this DataFrame — one for each token that appears on each page.
Let’s look at a sample of just 20 words from page 11.
Volume('mdp.39015029970129').tokenlist()[500:520]
count | ||||
---|---|---|---|---|
page | section | token | pos | |
11 | body | out | RP | 1 |
over | IN | 1 | ||
part | NN | 1 | ||
past | IN | 1 | ||
pee | VB | 1 | ||
pigeon | NN | 1 | ||
pigeons | NNS | 1 | ||
reach | VB | 1 | ||
remained | VBD | 1 | ||
roof | NN | 1 | ||
room | NN | 2 | ||
say | VB | 1 | ||
seemed | VBD | 1 | ||
set | VBN | 1 | ||
share | VB | 1 | ||
she | PRP | 7 | ||
silenccBometimes | NNS | 1 | ||
silence | NN | 1 | ||
simple | JJ | 1 | ||
slats | NNS | 1 |
We can also get metadata for a HathiTrust volume by asking for certain attributes.
Volume('mdp.39015029970129').year
1993
Volume('mdp.39015029970129').page_count
260
Volume('mdp.39015029970129').publisher
'HarperPerennial'
Multiple Volumes#
We might want to get extracted features for multiple volumes at the same time, so we’re also going to practice a workflow that will allow us to read in multiple HathiTrust books, even though we’re only reading in one book at this moment.
Insert list of desired HathiTrust volume(s)
volume_ids = ['mdp.39015029970129']
Loop through this list of volume IDs and make a DataFrame that includes extracted features, book title, and publication year, then make a list of all DataFrames.
all_tokens = []
for hathi_id in volume_ids:
#Read in HathiTrust volume
volume = Volume(hathi_id)
#Make dataframe from token list -- do not include part of speech, sections, or case sensitivity
token_df = volume.tokenlist(case=False, pos=False, drop_section=True)
#Add book column
token_df['book'] = volume.title
#Add publication year column
token_df['year'] = volume.year
all_tokens.append(token_df)
Concatenate the list of DataFrames
lost_df = pd.concat(all_tokens)
Preview the DataFrame
lost_df
count | book | year | ||
---|---|---|---|---|
page | lowercase | |||
1 | , | 1 | Lost in the city : stories / | 1993 |
.046 | 1 | Lost in the city : stories / | 1993 | |
1993 | 1 | Lost in the city : stories / | 1993 | |
3560 | 1 | Lost in the city : stories / | 1993 | |
a | 1 | Lost in the city : stories / | 1993 | |
... | ... | ... | ... | ... |
260 | would | 1 | Lost in the city : stories / | 1993 |
writers | 1 | Lost in the city : stories / | 1993 | |
written | 1 | Lost in the city : stories / | 1993 | |
york | 1 | Lost in the city : stories / | 1993 | |
• | 1 | Lost in the city : stories / | 1993 |
47307 rows × 3 columns
Change from multi-level index to regular index with reset_index()
lost_df_flattened = lost_df.reset_index()
lost_df_flattened
page | lowercase | count | book | year | |
---|---|---|---|---|---|
0 | 1 | , | 1 | Lost in the city : stories / | 1993 |
1 | 1 | .046 | 1 | Lost in the city : stories / | 1993 |
2 | 1 | 1993 | 1 | Lost in the city : stories / | 1993 |
3 | 1 | 3560 | 1 | Lost in the city : stories / | 1993 |
4 | 1 | a | 1 | Lost in the city : stories / | 1993 |
... | ... | ... | ... | ... | ... |
47302 | 260 | would | 1 | Lost in the city : stories / | 1993 |
47303 | 260 | writers | 1 | Lost in the city : stories / | 1993 |
47304 | 260 | written | 1 | Lost in the city : stories / | 1993 |
47305 | 260 | york | 1 | Lost in the city : stories / | 1993 |
47306 | 260 | • | 1 | Lost in the city : stories / | 1993 |
47307 rows × 5 columns
Nice! We now have a DataFrame of word counts per page for Lost in the City.
But what we need to move forward with tf-idf is a way of splitting this collection into its individual stories. Remember: to use tf-idf, we need a collection of texts because we need to compare word frequency for one document with all the other documents in the collection.
Add story titles#
How can we split up Lost in the City into individual stories?
Sometimes HathiTrust Extracted Features helpfully include “section” information for a book, such as chapter titles. Unfortunately, the extracted features for Lost in the City do not include chapter or story titles.
They do, however, include page numbers and, if you specify volume.tokenlist(case=True)
, words with case sensitivity. When I manually combed through the HTRC token list with case sensitivity turned on, I noticed that the title page for each short story seemed to format the title in all-caps. So I searched for all-caps words from each story title and noted down the corresponding page number. This should give us a marker of where every story begins and ends.
The function below will add in Lost in the City’s story titles for the correct page numbers and corresponding words.
def add_story_titles(page):
if page >= 0 and page < 11:
return "Front Matter"
if page >= 11 and page < 35:
return "01: The Girl Who Raised Pigeons"
elif page >= 35 and page < 41:
return "02: The First Day"
elif page >= 41 and page < 63:
return "03: The Night Rhonda Ferguson Was Killed"
elif page >= 63 and page < 85:
return "04: Young Lions"
elif page >= 85 and page < 113:
return "05: The Store"
elif page >= 113 and page < 125:
return "06: An Orange Line Train to Ballston"
elif page >= 125 and page < 149:
return "07: The Sunday Following Mother's Day"
elif page >= 149 and page < 159:
return "08: Lost in the City"
elif page >= 159 and page < 184:
return "09: His Mother's House"
elif page >= 184 and page < 191:
return "10: A Butterfly on F Street"
elif page >= 191 and page < 209:
return "11: Gospel"
elif page >= 209 and page < 225:
return "12: A New Man"
elif page >= 225 and page < 237:
return "13: A Dark Night"
elif page >= 237 and page <= 252:
return "14: Marie"
elif page > 252:
return "Back Matter"
Below we add a new column of story titles to the DataFrame by apply()
ing our function to the “page” column and dumping the results to lost_df_flattened['story']
. You can read more about applying functions in “Pandas Basics - Part 3”.
lost_df_flattened['story'] = lost_df_flattened['page'].apply(add_story_titles)
We’re also going to drop the “Front Matter” and “Back Matter” from the DataFrame.
lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Front Matter'].index)
lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Back Matter'].index)
Sum Word Counts For Each Story#
Page-level information is great. But for tf-idf purposes, we really only care about the frequency of words for every story. Below we group by story and calculate the sum of word frequencies for all the pages in that story.
lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()
story | lowercase | count | |
---|---|---|---|
0 | 01: The Girl Who Raised Pigeons | ! | 8 |
1 | 01: The Girl Who Raised Pigeons | ' | 4 |
2 | 01: The Girl Who Raised Pigeons | '' | 111 |
3 | 01: The Girl Who Raised Pigeons | 'd | 1 |
4 | 01: The Girl Who Raised Pigeons | 'll | 5 |
... | ... | ... | ... |
18082 | 14: Marie | yet | 1 |
18083 | 14: Marie | you | 39 |
18084 | 14: Marie | you-know-who | 1 |
18085 | 14: Marie | young | 8 |
18086 | 14: Marie | your | 8 |
18087 rows × 3 columns
Notice how the “page” column no longer exists in the DataFrame and our rows have slimmed down from more than 40,000 to 18,000.
word_frequency_df = lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()
Remove Infrequent Words, Stopwords, & Punctuation#
We will conclude with some final pre-processing steps. We will remove the list of stopwords defined below.
Make list of stopwords
STOPS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp', "!"]
Remove stopwords
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].isin(STOPS)].index)
We will also remove punctuation by using a regular expression [^A-Za-z\s]
, which matches anything that’s not a letter and drops it from the DataFrame.
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].str.contains('[^A-Za-z\s]', regex=True)].index)
#Remove words that appear less than 5 times in a book
#word_frequency_df_test = word_frequency_df[word_frequency_df['count'] > 5]
word_frequency_df
story | lowercase | count | |
---|---|---|---|
36 | 01: The Girl Who Raised Pigeons | abandoned | 2 |
37 | 01: The Girl Who Raised Pigeons | able | 2 |
40 | 01: The Girl Who Raised Pigeons | absently | 1 |
41 | 01: The Girl Who Raised Pigeons | absolute | 1 |
42 | 01: The Girl Who Raised Pigeons | accepted | 1 |
... | ... | ... | ... |
18079 | 14: Marie | years | 10 |
18080 | 14: Marie | yes | 2 |
18081 | 14: Marie | yesterday | 2 |
18082 | 14: Marie | yet | 1 |
18085 | 14: Marie | young | 8 |
15726 rows × 3 columns
TF-IDF#
Term Frequency#
We already have term frequencies for each document. Let’s rename the columns so that they’re consistent with the tf-idf vocabulary that we’ve been using.
word_frequency_df = word_frequency_df.rename(columns={'lowercase': 'term','count': 'term_frequency'})
word_frequency_df
story | term | term_frequency | |
---|---|---|---|
36 | 01: The Girl Who Raised Pigeons | abandoned | 2 |
37 | 01: The Girl Who Raised Pigeons | able | 2 |
40 | 01: The Girl Who Raised Pigeons | absently | 1 |
41 | 01: The Girl Who Raised Pigeons | absolute | 1 |
42 | 01: The Girl Who Raised Pigeons | accepted | 1 |
... | ... | ... | ... |
18079 | 14: Marie | years | 10 |
18080 | 14: Marie | yes | 2 |
18081 | 14: Marie | yesterday | 2 |
18082 | 14: Marie | yet | 1 |
18085 | 14: Marie | young | 8 |
15726 rows × 3 columns
Document Frequency#
To calculate the number of documents or stories in which each term appears, we’re going to create a separate DataFrame and do some Pandas manipulation and calculation.
document_frequency_df = (word_frequency_df.groupby(['story','term']).size().unstack()).sum().reset_index()
If you inspect parts of the complex chain of Pandas methods above (which is always a great way to learn!), you will see that we’re momentarily reshaping the DataFrame to see if each term appears in each story…
word_frequency_df.groupby(['story','term']).size().unstack()
Show code cell output
term | abandoned | abhored | abide | ability | able | abomination | aboum | aboutfcfteen | abqu | absently | ... | ypu | yr | ysirs | ythe | yuddini | zigzagging | zion | zipped | zippers | zoo |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
story | |||||||||||||||||||||
01: The Girl Who Raised Pigeons | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 1.0 | ... | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN |
02: The First Day | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
03: The Night Rhonda Ferguson Was Killed | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
04: Young Lions | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN |
05: The Store | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
06: An Orange Line Train to Ballston | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
07: The Sunday Following Mother's Day | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | ... | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
08: Lost in the City | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
09: His Mother's House | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
10: A Butterfly on F Street | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN |
11: Gospel | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN |
12: A New Man | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN |
13: A Dark Night | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14: Marie | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 rows × 6207 columns
Then we’re adding up how many stories each term appears in (.sum()
) and resetting the index (.reset_index()
) to make a DataFrame.
Finally, we will rename the column in this DataFrame and merge it into our word frequency DataFrame.
document_frequency_df = document_frequency_df.rename(columns={0:'document_frequency'})
word_frequency_df = word_frequency_df.merge(document_frequency_df)
Now we have term frequency and document frequency.
word_frequency_df
story | term | term_frequency | document_frequency | |
---|---|---|---|---|
0 | 01: The Girl Who Raised Pigeons | abandoned | 2 | 3.0 |
1 | 07: The Sunday Following Mother's Day | abandoned | 1 | 3.0 |
2 | 09: His Mother's House | abandoned | 1 | 3.0 |
3 | 01: The Girl Who Raised Pigeons | able | 2 | 12.0 |
4 | 03: The Night Rhonda Ferguson Was Killed | able | 3 | 12.0 |
... | ... | ... | ... | ... |
15721 | 14: Marie | whim | 1 | 1.0 |
15722 | 14: Marie | wilamena | 20 | 1.0 |
15723 | 14: Marie | wise | 8 | 1.0 |
15724 | 14: Marie | womanish | 1 | 1.0 |
15725 | 14: Marie | worships | 1 | 1.0 |
15726 rows × 4 columns
As you can see in the DataFrame above, the term “abandoned” appears 2 times in the story “The Girl Who Raised Pigeons” (term frequency), and it appears in 3 different stories in the collection overall (document frequency).
Total Number of Documents#
To calculate the total number of documents are in the collection, we count how many unique values are in the “story” column (we know the answer should be 14 short stories).
total_number_of_documents = lost_df_flattened['story'].nunique()
total_number_of_documents
14
Inverse Document Frequency#
As we previously established, there are a lot of slightly different versions of the tf-idf formula, but we’re going to use the default version from the scikit-learn library that adds “smoothing” to inverse document frequency.
inverse_document_frequency = log [ (1 + total number of docs) / (1 + document frequency) ] + 1
import numpy as np
word_frequency_df['idf'] = np.log((1 + total_number_of_documents) / (1 + word_frequency_df['document_frequency'])) + 1
TF- IDF#
Finally, we will calculate tf-idf by multiplying term frequency and inverse document frequency together.
word_frequency_df['tfidf'] = word_frequency_df['term_frequency'] * word_frequency_df['idf']
Then we will normalize these values with the scikit-learn library.
from sklearn import preprocessing
word_frequency_df['tfidf_normalized'] = preprocessing.normalize(word_frequency_df[['tfidf']], axis=0, norm='l2')
We did it! Now let’s inspect the top 15 words with the highest tfidf scores for each story in the collection
word_frequency_df.sort_values(by=['story','tfidf_normalized'], ascending=[True,False]).groupby(['story']).head(15)
story | term | term_frequency | document_frequency | idf | tfidf | tfidf_normalized | |
---|---|---|---|---|---|---|---|
655 | 01: The Girl Who Raised Pigeons | betsy | 44 | 1.0 | 3.014903 | 132.655733 | 0.106417 |
3317 | 01: The Girl Who Raised Pigeons | jenny | 42 | 1.0 | 3.014903 | 126.625927 | 0.101580 |
212 | 01: The Girl Who Raised Pigeons | ann | 45 | 2.0 | 2.609438 | 117.424706 | 0.094199 |
5566 | 01: The Girl Who Raised Pigeons | robert | 36 | 1.0 | 3.014903 | 108.536509 | 0.087069 |
1384 | 01: The Girl Who Raised Pigeons | coop | 28 | 1.0 | 3.014903 | 84.417285 | 0.067720 |
7887 | 01: The Girl Who Raised Pigeons | would | 84 | 14.0 | 1.000000 | 84.000000 | 0.067385 |
5053 | 01: The Girl Who Raised Pigeons | pigeons | 30 | 2.0 | 2.609438 | 78.283137 | 0.062799 |
4238 | 01: The Girl Who Raised Pigeons | miss | 46 | 10.0 | 1.310155 | 60.267127 | 0.048347 |
688 | 01: The Girl Who Raised Pigeons | birds | 29 | 5.0 | 1.916291 | 55.572431 | 0.044581 |
1191 | 01: The Girl Who Raised Pigeons | clara | 17 | 1.0 | 3.014903 | 51.253351 | 0.041116 |
5622 | 01: The Girl Who Raised Pigeons | said | 47 | 13.0 | 1.068993 | 50.242665 | 0.040305 |
5350 | 01: The Girl Who Raised Pigeons | ralph | 17 | 2.0 | 2.609438 | 44.360445 | 0.035586 |
4189 | 01: The Girl Who Raised Pigeons | miles | 21 | 4.0 | 2.098612 | 44.070858 | 0.035354 |
5052 | 01: The Girl Who Raised Pigeons | pigeon | 14 | 1.0 | 3.014903 | 42.208642 | 0.033860 |
6837 | 01: The Girl Who Raised Pigeons | thelma | 14 | 1.0 | 3.014903 | 42.208642 | 0.033860 |
4340 | 02: The First Day | mother | 42 | 13.0 | 1.068993 | 44.897701 | 0.036017 |
7772 | 02: The First Day | woman | 23 | 14.0 | 1.000000 | 23.000000 | 0.018451 |
8659 | 02: The First Day | takes | 7 | 2.0 | 2.609438 | 18.266065 | 0.014653 |
3895 | 02: The First Day | looks | 6 | 2.0 | 2.609438 | 15.656627 | 0.012560 |
8568 | 02: The First Day | says | 7 | 5.0 | 1.916291 | 13.414035 | 0.010761 |
8210 | 02: The First Day | form | 6 | 4.0 | 2.098612 | 12.591674 | 0.010101 |
5701 | 02: The First Day | school | 10 | 11.0 | 1.223144 | 12.231436 | 0.009812 |
8575 | 02: The First Day | seaton | 4 | 1.0 | 3.014903 | 12.059612 | 0.009674 |
4757 | 02: The First Day | one | 12 | 14.0 | 1.000000 | 12.000000 | 0.009626 |
8268 | 02: The First Day | jersey | 5 | 3.0 | 2.321756 | 11.608779 | 0.009313 |
8667 | 02: The First Day | tells | 4 | 3.0 | 2.321756 | 9.287023 | 0.007450 |
8036 | 02: The First Day | appears | 3 | 1.0 | 3.014903 | 9.044709 | 0.007256 |
8046 | 02: The First Day | asks | 3 | 1.0 | 3.014903 | 9.044709 | 0.007256 |
8089 | 02: The First Day | blondelle | 3 | 1.0 | 3.014903 | 9.044709 | 0.007256 |
8358 | 02: The First Day | mary | 3 | 1.0 | 3.014903 | 9.044709 | 0.007256 |
8974 | 03: The Night Rhonda Ferguson Was Killed | cassandra | 130 | 1.0 | 3.014903 | 391.937393 | 0.314415 |
9693 | 03: The Night Rhonda Ferguson Was Killed | melanie | 65 | 1.0 | 3.014903 | 195.968696 | 0.157207 |
8778 | 03: The Night Rhonda Ferguson Was Killed | anita | 68 | 2.0 | 2.609438 | 177.441778 | 0.142345 |
10002 | 03: The Night Rhonda Ferguson Was Killed | rhonda | 42 | 1.0 | 3.014903 | 126.625927 | 0.101580 |
5623 | 03: The Night Rhonda Ferguson Was Killed | said | 109 | 13.0 | 1.068993 | 116.520223 | 0.093473 |
9386 | 03: The Night Rhonda Ferguson Was Killed | gladys | 38 | 2.0 | 2.609438 | 99.158641 | 0.079546 |
8944 | 03: The Night Rhonda Ferguson Was Killed | car | 42 | 11.0 | 1.223144 | 51.372029 | 0.041211 |
6510 | 03: The Night Rhonda Ferguson Was Killed | street | 42 | 13.0 | 1.068993 | 44.897701 | 0.036017 |
453 | 03: The Night Rhonda Ferguson Was Killed | back | 39 | 14.0 | 1.000000 | 39.000000 | 0.031286 |
2602 | 03: The Night Rhonda Ferguson Was Killed | get | 36 | 13.0 | 1.068993 | 38.483743 | 0.030872 |
2646 | 03: The Night Rhonda Ferguson Was Killed | girls | 17 | 4.0 | 2.098612 | 35.676409 | 0.028620 |
2204 | 03: The Night Rhonda Ferguson Was Killed | father | 32 | 13.0 | 1.068993 | 34.207772 | 0.027442 |
10481 | 03: The Night Rhonda Ferguson Was Killed | wesley | 13 | 2.0 | 2.609438 | 33.922693 | 0.027213 |
9571 | 03: The Night Rhonda Ferguson Was Killed | joyce | 12 | 2.0 | 2.609438 | 31.313255 | 0.025120 |
9833 | 03: The Night Rhonda Ferguson Was Killed | pearl | 11 | 2.0 | 2.609438 | 28.703817 | 0.023026 |
10696 | 04: Young Lions | caesar | 75 | 1.0 | 3.014903 | 226.117727 | 0.181393 |
11527 | 04: Young Lions | sherman | 60 | 1.0 | 3.014903 | 180.894181 | 0.145114 |
11198 | 04: Young Lions | manny | 44 | 1.0 | 3.014903 | 132.655733 | 0.106417 |
10701 | 04: Young Lions | carol | 29 | 1.0 | 3.014903 | 87.432188 | 0.070139 |
5624 | 04: Young Lions | said | 71 | 13.0 | 1.068993 | 75.898494 | 0.060886 |
11452 | 04: Young Lions | retarded | 22 | 2.0 | 2.609438 | 57.407634 | 0.046053 |
7890 | 04: Young Lions | would | 57 | 14.0 | 1.000000 | 57.000000 | 0.045726 |
11016 | 04: Young Lions | heh | 17 | 1.0 | 3.014903 | 51.253351 | 0.041116 |
7774 | 04: Young Lions | woman | 44 | 14.0 | 1.000000 | 44.000000 | 0.035297 |
10568 | 04: Young Lions | anna | 13 | 1.0 | 3.014903 | 39.193739 | 0.031441 |
4006 | 04: Young Lions | man | 34 | 13.0 | 1.068993 | 36.345758 | 0.029157 |
454 | 04: Young Lions | back | 35 | 14.0 | 1.000000 | 35.000000 | 0.028077 |
1401 | 04: Young Lions | could | 29 | 13.0 | 1.068993 | 31.000793 | 0.024869 |
7199 | 04: Young Lions | two | 30 | 14.0 | 1.000000 | 30.000000 | 0.024066 |
2205 | 04: Young Lions | father | 28 | 13.0 | 1.068993 | 29.931800 | 0.024011 |
12577 | 05: The Store | penny | 57 | 2.0 | 2.609438 | 148.737961 | 0.119319 |
5625 | 05: The Store | said | 79 | 13.0 | 1.068993 | 84.450437 | 0.067747 |
7891 | 05: The Store | would | 79 | 14.0 | 1.000000 | 79.000000 | 0.063374 |
6479 | 05: The Store | store | 51 | 10.0 | 1.310155 | 66.817901 | 0.053602 |
12353 | 05: The Store | jenkins | 19 | 1.0 | 3.014903 | 57.283157 | 0.045953 |
12375 | 05: The Store | kentucky | 23 | 3.0 | 2.321756 | 53.400384 | 0.042838 |
12427 | 05: The Store | lonney | 17 | 1.0 | 3.014903 | 51.253351 | 0.041116 |
455 | 05: The Store | back | 50 | 14.0 | 1.000000 | 50.000000 | 0.040110 |
4760 | 05: The Store | one | 48 | 14.0 | 1.000000 | 48.000000 | 0.038506 |
4343 | 05: The Store | mother | 42 | 13.0 | 1.068993 | 44.897701 | 0.036017 |
6969 | 05: The Store | time | 42 | 14.0 | 1.000000 | 42.000000 | 0.033693 |
1552 | 05: The Store | day | 40 | 14.0 | 1.000000 | 40.000000 | 0.032088 |
1402 | 05: The Store | could | 34 | 13.0 | 1.068993 | 36.345758 | 0.029157 |
2206 | 05: The Store | father | 32 | 13.0 | 1.068993 | 34.207772 | 0.027442 |
11865 | 05: The Store | baxter | 11 | 1.0 | 3.014903 | 33.163933 | 0.026604 |
13144 | 06: An Orange Line Train to Ballston | marcus | 38 | 1.0 | 3.014903 | 114.566315 | 0.091906 |
13046 | 06: An Orange Line Train to Ballston | avis | 26 | 1.0 | 3.014903 | 78.387479 | 0.062883 |
13146 | 06: An Orange Line Train to Ballston | marvin | 24 | 1.0 | 3.014903 | 72.357672 | 0.058046 |
13145 | 06: An Orange Line Train to Ballston | marvella | 23 | 1.0 | 3.014903 | 69.342769 | 0.055627 |
5626 | 06: An Orange Line Train to Ballston | said | 64 | 13.0 | 1.068993 | 68.415544 | 0.054883 |
4008 | 06: An Orange Line Train to Ballston | man | 63 | 13.0 | 1.068993 | 67.346551 | 0.054026 |
7084 | 06: An Orange Line Train to Ballston | train | 25 | 5.0 | 1.916291 | 47.907268 | 0.038432 |
13216 | 06: An Orange Line Train to Ballston | subway | 15 | 1.0 | 3.014903 | 45.223545 | 0.036279 |
13087 | 06: An Orange Line Train to Ballston | dreadlocks | 11 | 1.0 | 3.014903 | 33.163933 | 0.026604 |
13086 | 06: An Orange Line Train to Ballston | dreadlock | 8 | 1.0 | 3.014903 | 24.119224 | 0.019349 |
3765 | 06: An Orange Line Train to Ballston | line | 18 | 10.0 | 1.310155 | 23.582789 | 0.018918 |
4800 | 06: An Orange Line Train to Ballston | orange | 11 | 4.0 | 2.098612 | 23.084735 | 0.018519 |
4344 | 06: An Orange Line Train to Ballston | mother | 17 | 13.0 | 1.068993 | 18.172879 | 0.014578 |
13050 | 06: An Orange Line Train to Ballston | ballston | 6 | 1.0 | 3.014903 | 18.089418 | 0.014511 |
3741 | 06: An Orange Line Train to Ballston | like | 17 | 14.0 | 1.000000 | 17.000000 | 0.013638 |
13627 | 07: The Sunday Following Mother's Day | madeleine | 74 | 1.0 | 3.014903 | 223.102824 | 0.178974 |
13626 | 07: The Sunday Following Mother's Day | maddie | 62 | 1.0 | 3.014903 | 186.923987 | 0.149952 |
13770 | 07: The Sunday Following Mother's Day | samuel | 41 | 1.0 | 3.014903 | 123.611024 | 0.099162 |
13769 | 07: The Sunday Following Mother's Day | sam | 34 | 1.0 | 3.014903 | 102.506703 | 0.082232 |
5627 | 07: The Sunday Following Mother's Day | said | 81 | 13.0 | 1.068993 | 86.588423 | 0.069462 |
7893 | 07: The Sunday Following Mother's Day | would | 71 | 14.0 | 1.000000 | 71.000000 | 0.056957 |
13711 | 07: The Sunday Following Mother's Day | pookie | 21 | 1.0 | 3.014903 | 63.312963 | 0.050790 |
9093 | 07: The Sunday Following Mother's Day | curtis | 16 | 2.0 | 2.609438 | 41.751007 | 0.033493 |
13281 | 07: The Sunday Following Mother's Day | arnisa | 13 | 1.0 | 3.014903 | 39.193739 | 0.031441 |
13893 | 07: The Sunday Following Mother's Day | williams | 12 | 1.0 | 3.014903 | 36.178836 | 0.029023 |
1404 | 07: The Sunday Following Mother's Day | could | 33 | 13.0 | 1.068993 | 35.276765 | 0.028299 |
1554 | 07: The Sunday Following Mother's Day | day | 35 | 14.0 | 1.000000 | 35.000000 | 0.028077 |
4009 | 07: The Sunday Following Mother's Day | man | 32 | 13.0 | 1.068993 | 34.207772 | 0.027442 |
457 | 07: The Sunday Following Mother's Day | back | 33 | 14.0 | 1.000000 | 33.000000 | 0.026473 |
4762 | 07: The Sunday Following Mother's Day | one | 32 | 14.0 | 1.000000 | 32.000000 | 0.025671 |
14079 | 08: Lost in the City | lydia | 32 | 1.0 | 3.014903 | 96.476897 | 0.077394 |
5628 | 08: Lost in the City | said | 46 | 13.0 | 1.068993 | 49.173672 | 0.039447 |
4346 | 08: Lost in the City | mother | 44 | 13.0 | 1.068993 | 47.035686 | 0.037732 |
10969 | 08: Lost in the City | georgia | 19 | 3.0 | 2.321756 | 44.113361 | 0.035388 |
972 | 08: Lost in the City | cab | 13 | 5.0 | 1.916291 | 24.911780 | 0.019984 |
9157 | 08: Lost in the City | driver | 8 | 5.0 | 1.916291 | 15.330326 | 0.012298 |
13908 | 08: Lost in the City | antibes | 5 | 1.0 | 3.014903 | 15.074515 | 0.012093 |
13976 | 08: Lost in the City | dreaming | 5 | 1.0 | 3.014903 | 15.074515 | 0.012093 |
3455 | 08: Lost in the City | know | 15 | 14.0 | 1.000000 | 15.000000 | 0.012033 |
7894 | 08: Lost in the City | would | 15 | 14.0 | 1.000000 | 15.000000 | 0.012033 |
2607 | 08: Lost in the City | get | 14 | 13.0 | 1.068993 | 14.965900 | 0.012006 |
4010 | 08: Lost in the City | man | 14 | 13.0 | 1.068993 | 14.965900 | 0.012006 |
6925 | 08: Lost in the City | thought | 14 | 14.0 | 1.000000 | 14.000000 | 0.011231 |
10189 | 08: Lost in the City | sorry | 9 | 8.0 | 1.510826 | 13.597431 | 0.010908 |
4763 | 08: Lost in the City | one | 13 | 14.0 | 1.000000 | 13.000000 | 0.010429 |
9572 | 09: His Mother's House | joyce | 84 | 2.0 | 2.609438 | 219.192785 | 0.175838 |
14513 | 09: His Mother's House | rickey | 64 | 1.0 | 3.014903 | 192.953793 | 0.154789 |
14529 | 09: His Mother's House | santiago | 54 | 1.0 | 3.014903 | 162.804763 | 0.130603 |
5629 | 09: His Mother's House | said | 96 | 13.0 | 1.068993 | 102.623316 | 0.082325 |
14384 | 09: His Mother's House | humphrey | 33 | 1.0 | 3.014903 | 99.491800 | 0.079813 |
9834 | 09: His Mother's House | pearl | 22 | 2.0 | 2.609438 | 57.407634 | 0.046053 |
14527 | 09: His Mother's House | sandy | 18 | 1.0 | 3.014903 | 54.268254 | 0.043534 |
4764 | 09: His Mother's House | one | 50 | 14.0 | 1.000000 | 50.000000 | 0.040110 |
7895 | 09: His Mother's House | would | 49 | 14.0 | 1.000000 | 49.000000 | 0.039308 |
14577 | 09: His Mother's House | smokey | 16 | 1.0 | 3.014903 | 48.238448 | 0.038697 |
3172 | 09: His Mother's House | house | 35 | 12.0 | 1.143101 | 40.008530 | 0.032095 |
3744 | 09: His Mother's House | like | 38 | 14.0 | 1.000000 | 38.000000 | 0.030484 |
4347 | 09: His Mother's House | mother | 35 | 13.0 | 1.068993 | 37.414751 | 0.030014 |
6671 | 09: His Mother's House | table | 29 | 11.0 | 1.223144 | 35.471163 | 0.028455 |
7779 | 09: His Mother's House | woman | 35 | 14.0 | 1.000000 | 35.000000 | 0.028077 |
9700 | 10: A Butterfly on F Street | mildred | 27 | 2.0 | 2.609438 | 70.454824 | 0.056519 |
7780 | 10: A Butterfly on F Street | woman | 29 | 14.0 | 1.000000 | 29.000000 | 0.023264 |
14675 | 10: A Butterfly on F Street | butterfly | 7 | 1.0 | 3.014903 | 21.104321 | 0.016930 |
14710 | 10: A Butterfly on F Street | mansfield | 6 | 1.0 | 3.014903 | 18.089418 | 0.014511 |
5630 | 10: A Butterfly on F Street | said | 16 | 13.0 | 1.068993 | 17.103886 | 0.013721 |
6517 | 10: A Butterfly on F Street | street | 14 | 13.0 | 1.068993 | 14.965900 | 0.012006 |
14713 | 10: A Butterfly on F Street | median | 4 | 1.0 | 3.014903 | 12.059612 | 0.009674 |
14750 | 10: A Butterfly on F Street | woolworth | 4 | 1.0 | 3.014903 | 12.059612 | 0.009674 |
460 | 10: A Butterfly on F Street | back | 10 | 14.0 | 1.000000 | 10.000000 | 0.008022 |
5672 | 10: A Butterfly on F Street | say | 9 | 13.0 | 1.068993 | 9.620936 | 0.007718 |
14714 | 10: A Butterfly on F Street | morton | 3 | 1.0 | 3.014903 | 9.044709 | 0.007256 |
9258 | 10: A Butterfly on F Street | f | 5 | 6.0 | 1.762140 | 8.810700 | 0.007068 |
1557 | 10: A Butterfly on F Street | day | 8 | 14.0 | 1.000000 | 8.000000 | 0.006418 |
7896 | 10: A Butterfly on F Street | would | 8 | 14.0 | 1.000000 | 8.000000 | 0.006418 |
9387 | 10: A Butterfly on F Street | gladys | 3 | 2.0 | 2.609438 | 7.828314 | 0.006280 |
15094 | 11: Gospel | vivian | 68 | 1.0 | 3.014903 | 205.013405 | 0.164463 |
14845 | 11: Gospel | diane | 42 | 1.0 | 3.014903 | 126.625927 | 0.101580 |
14943 | 11: Gospel | maude | 32 | 1.0 | 3.014903 | 96.476897 | 0.077394 |
5631 | 11: Gospel | said | 77 | 13.0 | 1.068993 | 82.312451 | 0.066032 |
8779 | 11: Gospel | anita | 25 | 2.0 | 2.609438 | 65.235948 | 0.052333 |
15005 | 11: Gospel | reverend | 18 | 2.0 | 2.609438 | 46.969882 | 0.037680 |
14895 | 11: Gospel | gospelteers | 15 | 1.0 | 3.014903 | 45.223545 | 0.036279 |
14938 | 11: Gospel | mae | 15 | 1.0 | 3.014903 | 45.223545 | 0.036279 |
2845 | 11: Gospel | group | 26 | 7.0 | 1.628609 | 42.343825 | 0.033968 |
7897 | 11: Gospel | would | 42 | 14.0 | 1.000000 | 42.000000 | 0.033693 |
1408 | 11: Gospel | could | 39 | 13.0 | 1.068993 | 41.690722 | 0.033445 |
1176 | 11: Gospel | church | 26 | 9.0 | 1.405465 | 36.542093 | 0.029314 |
14835 | 11: Gospel | counsel | 12 | 1.0 | 3.014903 | 36.178836 | 0.029023 |
7795 | 11: Gospel | women | 31 | 12.0 | 1.143101 | 35.436126 | 0.028427 |
4013 | 11: Gospel | man | 31 | 13.0 | 1.068993 | 33.138779 | 0.026584 |
15347 | 12: A New Man | woodrow | 57 | 1.0 | 3.014903 | 171.849472 | 0.137859 |
15280 | 12: A New Man | rita | 22 | 1.0 | 3.014903 | 66.327866 | 0.053209 |
7898 | 12: A New Man | would | 37 | 14.0 | 1.000000 | 37.000000 | 0.029682 |
5632 | 12: A New Man | said | 33 | 13.0 | 1.068993 | 35.276765 | 0.028299 |
4014 | 12: A New Man | man | 27 | 13.0 | 1.068993 | 28.862808 | 0.023154 |
15185 | 12: A New Man | elaine | 9 | 1.0 | 3.014903 | 27.134127 | 0.021767 |
1409 | 12: A New Man | could | 23 | 13.0 | 1.068993 | 24.586836 | 0.019724 |
2213 | 12: A New Man | father | 22 | 13.0 | 1.068993 | 23.517843 | 0.018866 |
4767 | 12: A New Man | one | 23 | 14.0 | 1.000000 | 23.000000 | 0.018451 |
15161 | 12: A New Man | cunningham | 7 | 1.0 | 3.014903 | 21.104321 | 0.016930 |
1547 | 12: A New Man | daughter | 16 | 10.0 | 1.310155 | 20.962479 | 0.016816 |
8531 | 12: A New Man | read | 12 | 7.0 | 1.628609 | 19.543304 | 0.015678 |
2029 | 12: A New Man | even | 16 | 13.0 | 1.068993 | 17.103886 | 0.013721 |
4732 | 12: A New Man | old | 16 | 13.0 | 1.068993 | 17.103886 | 0.013721 |
5300 | 12: A New Man | put | 16 | 13.0 | 1.068993 | 17.103886 | 0.013721 |
15424 | 13: A Dark Night | garrett | 38 | 1.0 | 3.014903 | 114.566315 | 0.091906 |
15361 | 13: A Dark Night | beatrice | 31 | 1.0 | 3.014903 | 93.461994 | 0.074976 |
15377 | 13: A Dark Night | carmena | 20 | 1.0 | 3.014903 | 60.298060 | 0.048371 |
5633 | 13: A Dark Night | said | 49 | 13.0 | 1.068993 | 52.380651 | 0.042020 |
10438 | 13: A Dark Night | uncle | 14 | 2.0 | 2.609438 | 36.532131 | 0.029306 |
15517 | 13: A Dark Night | thunder | 12 | 1.0 | 3.014903 | 36.178836 | 0.029023 |
15433 | 13: A Dark Night | henry | 10 | 1.0 | 3.014903 | 30.149030 | 0.024186 |
1737 | 13: A Dark Night | door | 27 | 14.0 | 1.000000 | 27.000000 | 0.021660 |
1516 | 13: A Dark Night | daddy | 17 | 8.0 | 1.510826 | 25.684036 | 0.020604 |
15442 | 13: A Dark Night | joe | 8 | 1.0 | 3.014903 | 24.119224 | 0.019349 |
4768 | 13: A Dark Night | one | 23 | 14.0 | 1.000000 | 23.000000 | 0.018451 |
15364 | 13: A Dark Night | boone | 7 | 1.0 | 3.014903 | 21.104321 | 0.016930 |
15451 | 13: A Dark Night | lightning | 7 | 1.0 | 3.014903 | 21.104321 | 0.016930 |
7899 | 13: A Dark Night | would | 21 | 14.0 | 1.000000 | 21.000000 | 0.016846 |
14055 | 13: A Dark Night | john | 10 | 4.0 | 2.098612 | 20.986123 | 0.016835 |
13633 | 14: Marie | marie | 49 | 2.0 | 2.609438 | 127.862458 | 0.102572 |
15719 | 14: Marie | vernelle | 30 | 1.0 | 3.014903 | 90.447091 | 0.072557 |
15722 | 14: Marie | wilamena | 20 | 1.0 | 3.014903 | 60.298060 | 0.048371 |
5634 | 14: Marie | said | 51 | 13.0 | 1.068993 | 54.518636 | 0.043735 |
7900 | 14: Marie | would | 36 | 14.0 | 1.000000 | 36.000000 | 0.028879 |
4016 | 14: Marie | man | 32 | 13.0 | 1.068993 | 34.207772 | 0.027442 |
11492 | 14: Marie | security | 13 | 2.0 | 2.609438 | 33.922693 | 0.027213 |
7784 | 14: Marie | woman | 31 | 14.0 | 1.000000 | 31.000000 | 0.024868 |
15662 | 14: Marie | receptionist | 10 | 1.0 | 3.014903 | 30.149030 | 0.024186 |
7027 | 14: Marie | told | 25 | 13.0 | 1.068993 | 26.724822 | 0.021439 |
11578 | 14: Marie | social | 10 | 2.0 | 2.609438 | 26.094379 | 0.020933 |
4769 | 14: Marie | one | 25 | 14.0 | 1.000000 | 25.000000 | 0.020055 |
15552 | 14: Marie | calhoun | 8 | 1.0 | 3.014903 | 24.119224 | 0.019349 |
15723 | 14: Marie | wise | 8 | 1.0 | 3.014903 | 24.119224 | 0.019349 |
15037 | 14: Marie | smith | 9 | 2.0 | 2.609438 | 23.484941 | 0.018840 |
It turns out that “pigeons” are pretty unique to the first short story in Lost in the City and have a normalized tf-idf score of .062, making it one of the most distinctive words in that story along with “coop” and “birds.”
What are some other distinctive words in Lost in the City?
Further Resources#
Peter Organisciak and Boris Capitanu, “Text Mining in Python through the HTRC Feature Reader,” The Programming Historian