TF-IDF with Scikit-Learn

In the previous lesson, we learned about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf. Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. We specifically learned how to calculate tf-idf scores using word frequencies per page—or “extracted features”—made available by the HathiTrust Digital Library.

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

  • Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn

Dataset

U.S. Inaugural Addresses

This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.

—Barack Obama, Inaugural Presidential Address, January 2009

During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.

Breaking Down the TF-IDF Formula

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

tf-idf = term_frequency * inverse_document_frequency

term_frequency = number of times a given term appears in document

inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

TF-IDF with scikit-learn

scikit-learn, imported as sklearn, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer and CountVectorizer.

Install scikit-learn

!pip install sklearn

Import necessary modules and libraries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
from pathlib import Path  
import glob

Pandas

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

We’re also going to import pandas and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: pathlib and glob.

Set Directory Path

Below we’re setting the directory filepath that contains all the text files that we want to analyze.

directory_path = "../texts/history/US_Inaugural_Addresses/"

Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles.

text_files = glob.glob(f"{directory_path}/*.txt")
text_files
['../texts/history/US_Inaugural_Addresses/13_van_buren_1837.txt',
 '../texts/history/US_Inaugural_Addresses/47_nixon_1973.txt',
 '../texts/history/US_Inaugural_Addresses/50_reagan_1985.txt',
 '../texts/history/US_Inaugural_Addresses/53_clinton_1997.txt',
 '../texts/history/US_Inaugural_Addresses/17_pierce_1853.txt',
 '../texts/history/US_Inaugural_Addresses/14_harrison_1841.txt',
 '../texts/history/US_Inaugural_Addresses/56_obama_2009.txt',
 '../texts/history/US_Inaugural_Addresses/25_cleveland_1885.txt',
 '../texts/history/US_Inaugural_Addresses/03_adams_john_1797.txt',
 '../texts/history/US_Inaugural_Addresses/12_jackson_1833.txt',
 '../texts/history/US_Inaugural_Addresses/11_jackson_1829.txt',
 '../texts/history/US_Inaugural_Addresses/36_hoover_1929.txt',
 '../texts/history/US_Inaugural_Addresses/45_johnson_1965.txt',
 '../texts/history/US_Inaugural_Addresses/51_bush_george_h_w_1989.txt',
 '../texts/history/US_Inaugural_Addresses/21_grant_1869.txt',
 '../texts/history/US_Inaugural_Addresses/41_truman_1949.txt',
 '../texts/history/US_Inaugural_Addresses/33_wilson_1917.txt',
 '../texts/history/US_Inaugural_Addresses/49_reagan_1981.txt',
 '../texts/history/US_Inaugural_Addresses/30_roosevelt_theodore_1905.txt',
 '../texts/history/US_Inaugural_Addresses/07_madison_1813.txt',
 '../texts/history/US_Inaugural_Addresses/09_monroe_1821.txt',
 '../texts/history/US_Inaugural_Addresses/48_carter_1977.txt',
 '../texts/history/US_Inaugural_Addresses/32_wilson_1913.txt',
 '../texts/history/US_Inaugural_Addresses/19_lincoln_1861.txt',
 '../texts/history/US_Inaugural_Addresses/01_washington_1789.txt',
 '../texts/history/US_Inaugural_Addresses/29_mckinley_1901.txt',
 '../texts/history/US_Inaugural_Addresses/04_jefferson_1801.txt',
 '../texts/history/US_Inaugural_Addresses/34_harding_1921.txt',
 '../texts/history/US_Inaugural_Addresses/52_clinton_1993.txt',
 '../texts/history/US_Inaugural_Addresses/35_coolidge_1925.txt',
 '../texts/history/US_Inaugural_Addresses/39_roosevelt_franklin_1941.txt',
 '../texts/history/US_Inaugural_Addresses/28_mckinley_1897.txt',
 '../texts/history/US_Inaugural_Addresses/24_garfield_1881.txt',
 '../texts/history/US_Inaugural_Addresses/22_grant_1873.txt',
 '../texts/history/US_Inaugural_Addresses/15_polk_1845.txt',
 '../texts/history/US_Inaugural_Addresses/54_bush_george_w_2001.txt',
 '../texts/history/US_Inaugural_Addresses/02_washington_1793.txt',
 '../texts/history/US_Inaugural_Addresses/38_roosevelt_franklin_1937.txt',
 '../texts/history/US_Inaugural_Addresses/37_roosevelt_franklin_1933.txt',
 '../texts/history/US_Inaugural_Addresses/18_buchanan_1857.txt',
 '../texts/history/US_Inaugural_Addresses/16_taylor_1849.txt',
 '../texts/history/US_Inaugural_Addresses/05_jefferson_1805.txt',
 '../texts/history/US_Inaugural_Addresses/26_harrison_1889.txt',
 '../texts/history/US_Inaugural_Addresses/44_kennedy_1961.txt',
 '../texts/history/US_Inaugural_Addresses/23_hayes_1877.txt',
 '../texts/history/US_Inaugural_Addresses/20_lincoln_1865.txt',
 '../texts/history/US_Inaugural_Addresses/57_obama_2013.txt',
 '../texts/history/US_Inaugural_Addresses/10_adams_john_quincy_1825.txt',
 '../texts/history/US_Inaugural_Addresses/55_bush_george_w_2005.txt',
 '../texts/history/US_Inaugural_Addresses/27_cleveland_1893.txt',
 '../texts/history/US_Inaugural_Addresses/46_nixon_1969.txt',
 '../texts/history/US_Inaugural_Addresses/42_eisenhower_1953.txt',
 '../texts/history/US_Inaugural_Addresses/40_roosevelt_franklin_1945.txt',
 '../texts/history/US_Inaugural_Addresses/43_eisenhower_1957.txt',
 '../texts/history/US_Inaugural_Addresses/08_monroe_1817.txt',
 '../texts/history/US_Inaugural_Addresses/06_madison_1809.txt',
 '../texts/history/US_Inaugural_Addresses/58_trump_2017.txt',
 '../texts/history/US_Inaugural_Addresses/31_taft_1909.txt']
text_titles = [Path(text).stem for text in text_files]
text_titles
['13_van_buren_1837',
 '47_nixon_1973',
 '50_reagan_1985',
 '53_clinton_1997',
 '17_pierce_1853',
 '14_harrison_1841',
 '56_obama_2009',
 '25_cleveland_1885',
 '03_adams_john_1797',
 '12_jackson_1833',
 '11_jackson_1829',
 '36_hoover_1929',
 '45_johnson_1965',
 '51_bush_george_h_w_1989',
 '21_grant_1869',
 '41_truman_1949',
 '33_wilson_1917',
 '49_reagan_1981',
 '30_roosevelt_theodore_1905',
 '07_madison_1813',
 '09_monroe_1821',
 '48_carter_1977',
 '32_wilson_1913',
 '19_lincoln_1861',
 '01_washington_1789',
 '29_mckinley_1901',
 '04_jefferson_1801',
 '34_harding_1921',
 '52_clinton_1993',
 '35_coolidge_1925',
 '39_roosevelt_franklin_1941',
 '28_mckinley_1897',
 '24_garfield_1881',
 '22_grant_1873',
 '15_polk_1845',
 '54_bush_george_w_2001',
 '02_washington_1793',
 '38_roosevelt_franklin_1937',
 '37_roosevelt_franklin_1933',
 '18_buchanan_1857',
 '16_taylor_1849',
 '05_jefferson_1805',
 '26_harrison_1889',
 '44_kennedy_1961',
 '23_hayes_1877',
 '20_lincoln_1865',
 '57_obama_2013',
 '10_adams_john_quincy_1825',
 '55_bush_george_w_2005',
 '27_cleveland_1893',
 '46_nixon_1969',
 '42_eisenhower_1953',
 '40_roosevelt_franklin_1945',
 '43_eisenhower_1957',
 '08_monroe_1817',
 '06_madison_1809',
 '58_trump_2017',
 '31_taft_1909']

Calculate tf–idf

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our text_files

tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

Add column for document frequency aka number of times word appears in all documents

tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)
government borders people obama war honor foreign men women children
00_Document Frequency 53.00 5.00 56.00 3.00 45.00 32.00 32.00 47.00 15.00 22.00
01_washington_1789 0.11 0.00 0.05 0.00 0.00 0.00 0.00 0.02 0.00 0.00
02_washington_1793 0.06 0.00 0.05 0.00 0.00 0.08 0.00 0.00 0.00 0.00
03_adams_john_1797 0.16 0.00 0.19 0.00 0.01 0.10 0.12 0.04 0.00 0.00
04_jefferson_1801 0.16 0.00 0.01 0.00 0.01 0.04 0.00 0.04 0.00 0.00
05_jefferson_1805 0.03 0.00 0.00 0.00 0.04 0.00 0.06 0.01 0.00 0.02
06_madison_1809 0.00 0.00 0.02 0.00 0.02 0.05 0.05 0.00 0.00 0.00
07_madison_1813 0.04 0.00 0.04 0.00 0.25 0.02 0.02 0.00 0.00 0.00
08_monroe_1817 0.17 0.00 0.11 0.00 0.09 0.01 0.10 0.04 0.00 0.00
09_monroe_1821 0.08 0.00 0.06 0.00 0.11 0.02 0.04 0.01 0.00 0.01
10_adams_john_quincy_1825 0.15 0.00 0.06 0.00 0.05 0.01 0.08 0.03 0.00 0.00
11_jackson_1829 0.10 0.00 0.06 0.00 0.02 0.02 0.07 0.02 0.00 0.00
12_jackson_1833 0.21 0.00 0.14 0.00 0.00 0.00 0.02 0.00 0.00 0.00
13_van_buren_1837 0.12 0.00 0.14 0.00 0.02 0.02 0.06 0.02 0.00 0.01
14_harrison_1841 0.14 0.00 0.14 0.00 0.01 0.02 0.03 0.03 0.00 0.00
15_polk_1845 0.26 0.00 0.08 0.00 0.03 0.01 0.09 0.02 0.00 0.01
16_taylor_1849 0.12 0.00 0.05 0.00 0.00 0.02 0.05 0.00 0.00 0.00
17_pierce_1853 0.08 0.00 0.05 0.00 0.00 0.02 0.04 0.01 0.00 0.03
18_buchanan_1857 0.12 0.00 0.11 0.00 0.08 0.01 0.04 0.03 0.00 0.05
19_lincoln_1861 0.12 0.00 0.13 0.00 0.02 0.00 0.02 0.00 0.00 0.00
20_lincoln_1865 0.02 0.00 0.00 0.00 0.27 0.00 0.00 0.04 0.00 0.00
21_grant_1869 0.05 0.00 0.03 0.00 0.02 0.05 0.05 0.02 0.00 0.00
22_grant_1873 0.06 0.00 0.10 0.00 0.05 0.02 0.00 0.00 0.00 0.00
23_hayes_1877 0.17 0.00 0.08 0.00 0.00 0.00 0.04 0.02 0.00 0.00
24_garfield_1881 0.19 0.00 0.16 0.00 0.05 0.00 0.00 0.01 0.00 0.04
25_cleveland_1885 0.21 0.00 0.21 0.00 0.00 0.00 0.05 0.01 0.00 0.00
26_harrison_1889 0.06 0.00 0.17 0.00 0.02 0.03 0.01 0.04 0.00 0.00
27_cleveland_1893 0.15 0.00 0.22 0.00 0.00 0.00 0.00 0.04 0.00 0.00
28_mckinley_1897 0.16 0.00 0.16 0.00 0.05 0.03 0.06 0.02 0.00 0.00
29_mckinley_1901 0.15 0.00 0.12 0.00 0.08 0.04 0.01 0.01 0.00 0.00
30_roosevelt_theodore_1905 0.05 0.00 0.10 0.00 0.00 0.00 0.00 0.08 0.00 0.10
31_taft_1909 0.12 0.00 0.03 0.00 0.03 0.01 0.03 0.03 0.00 0.00
32_wilson_1913 0.11 0.00 0.02 0.00 0.00 0.00 0.00 0.14 0.10 0.06
33_wilson_1917 0.00 0.00 0.08 0.00 0.07 0.00 0.00 0.05 0.00 0.00
34_harding_1921 0.08 0.00 0.05 0.00 0.12 0.00 0.00 0.01 0.00 0.00
35_coolidge_1925 0.10 0.00 0.10 0.00 0.02 0.01 0.02 0.02 0.02 0.00
36_hoover_1929 0.20 0.04 0.10 0.00 0.01 0.00 0.01 0.03 0.03 0.01
37_roosevelt_franklin_1933 0.03 0.00 0.08 0.00 0.02 0.02 0.03 0.02 0.00 0.00
38_roosevelt_franklin_1937 0.18 0.03 0.12 0.00 0.01 0.00 0.00 0.10 0.07 0.02
39_roosevelt_franklin_1941 0.05 0.00 0.08 0.00 0.00 0.00 0.00 0.07 0.03 0.00
40_roosevelt_franklin_1945 0.00 0.00 0.02 0.00 0.05 0.03 0.00 0.10 0.05 0.04
41_truman_1949 0.03 0.00 0.10 0.00 0.02 0.01 0.01 0.06 0.00 0.00
42_eisenhower_1953 0.01 0.00 0.10 0.00 0.04 0.03 0.00 0.07 0.00 0.00
43_eisenhower_1957 0.00 0.00 0.10 0.00 0.01 0.05 0.00 0.07 0.00 0.00
44_kennedy_1961 0.00 0.00 0.01 0.00 0.06 0.00 0.00 0.01 0.00 0.00
45_johnson_1965 0.01 0.00 0.11 0.00 0.01 0.00 0.02 0.03 0.00 0.05
46_nixon_1969 0.05 0.00 0.13 0.00 0.03 0.03 0.00 0.01 0.00 0.00
47_nixon_1973 0.10 0.00 0.06 0.00 0.03 0.01 0.00 0.00 0.00 0.02
48_carter_1977 0.06 0.00 0.08 0.00 0.02 0.00 0.02 0.00 0.00 0.00
49_reagan_1981 0.16 0.00 0.08 0.00 0.01 0.00 0.00 0.02 0.04 0.09
50_reagan_1985 0.16 0.00 0.14 0.00 0.01 0.01 0.00 0.03 0.04 0.00
51_bush_george_h_w_1989 0.05 0.00 0.06 0.00 0.03 0.00 0.01 0.04 0.06 0.07
52_clinton_1993 0.05 0.00 0.13 0.00 0.03 0.00 0.02 0.01 0.02 0.06
53_clinton_1997 0.09 0.00 0.09 0.00 0.01 0.00 0.00 0.00 0.02 0.10
54_bush_george_w_2001 0.05 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.08
55_bush_george_w_2005 0.03 0.06 0.05 0.00 0.00 0.04 0.00 0.02 0.04 0.00
56_obama_2009 0.03 0.03 0.07 0.03 0.02 0.01 0.00 0.04 0.08 0.05
57_obama_2013 0.04 0.00 0.11 0.04 0.04 0.00 0.00 0.04 0.04 0.06
58_trump_2017 0.04 0.11 0.11 0.12 0.00 0.00 0.05 0.03 0.05 0.04

Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.

tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let’s reorganize the DataFrame so that the words are in rows rather than columns.

tfidf_df.stack().reset_index()
level_0 level_1 0
0 13_van_buren_1837 000 0.000000
1 13_van_buren_1837 03 0.011681
2 13_van_buren_1837 04 0.011924
3 13_van_buren_1837 05 0.000000
4 13_van_buren_1837 100 0.000000
... ... ... ...
521937 31_taft_1909 zachary 0.000000
521938 31_taft_1909 zeal 0.000000
521939 31_taft_1909 zealous 0.000000
521940 31_taft_1909 zealously 0.000000
521941 31_taft_1909 zone 0.000000

521942 rows × 3 columns

tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
document term tfidf
219683 01_washington_1789 government 0.113681
220084 01_washington_1789 immutable 0.103883
220151 01_washington_1789 impressions 0.103883
222313 01_washington_1789 providential 0.103883
221607 01_washington_1789 ought 0.103728
222327 01_washington_1789 public 0.103102
222093 01_washington_1789 present 0.097516
222365 01_washington_1789 qualifications 0.096372
221787 01_washington_1789 peculiarly 0.090546
216629 01_washington_1789 article 0.085786
323983 02_washington_1793 1793 0.229350
324608 02_washington_1793 arrive 0.229350
332541 02_washington_1793 upbraidings 0.229350
328215 02_washington_1793 incurring 0.208140
332665 02_washington_1793 violated 0.208140
332837 02_washington_1793 willingly 0.208140
328333 02_washington_1793 injunctions 0.193091
328670 02_washington_1793 knowingly 0.193091
330122 02_washington_1793 previous 0.193091
332875 02_washington_1793 witnesses 0.193091
77815 03_adams_john_1797 people 0.191180
75699 03_adams_john_1797 government 0.160937
77953 03_adams_john_1797 pleasing 0.147066
75456 03_adams_john_1797 foreign 0.116874
77350 03_adams_john_1797 nations 0.114480
80705 03_adams_john_1797 virtuous 0.110813
76010 03_adams_john_1797 houses 0.110300
76793 03_adams_john_1797 legislatures 0.110300
73753 03_adams_john_1797 constitution 0.104525
75977 03_adams_john_1797 honor 0.102265
237681 04_jefferson_1801 government 0.155691
240148 04_jefferson_1801 principle 0.130113
238791 04_jefferson_1801 let 0.117970
241040 04_jefferson_1801 safety 0.108427
238987 04_jefferson_1801 man 0.106841
242072 04_jefferson_1801 thousandth 0.104513
237956 04_jefferson_1801 honest 0.101696
237304 04_jefferson_1801 fellow 0.097240
240870 04_jefferson_1801 retire 0.094848
239551 04_jefferson_1801 opinion 0.092587
375310 05_jefferson_1805 public 0.180456
372220 05_jefferson_1805 false 0.135863
376581 05_jefferson_1805 state 0.121514
377799 05_jefferson_1805 whatsoever 0.116886
373835 05_jefferson_1805 limits 0.107085
370331 05_jefferson_1805 citizens 0.106592
375449 05_jefferson_1805 reason 0.104438
370444 05_jefferson_1805 comforts 0.101880
375094 05_jefferson_1805 press 0.101549
372101 05_jefferson_1805 expenses 0.096524
499129 06_madison_1809 improvements 0.152559
495873 06_madison_1809 belligerent 0.123161
501296 06_madison_1809 public 0.122235
500303 06_madison_1809 nations 0.104588
501678 06_madison_1809 rendered 0.101706
495732 06_madison_1809 authorities 0.089155
495743 06_madison_1809 avail 0.089155
497991 06_madison_1809 examples 0.089155
496848 06_madison_1809 councils 0.085894
500508 06_madison_1809 ones 0.085894
179746 07_madison_1813 war 0.254249
172088 07_madison_1813 british 0.222972
176049 07_madison_1813 massacre 0.119009
172191 07_madison_1813 captives 0.108003
172963 07_madison_1813 cruel 0.108003
177162 07_madison_1813 prisoners 0.108003
178083 07_madison_1813 savage 0.108003
173727 07_madison_1813 element 0.085005
173828 07_madison_1813 enemy 0.085005
174967 07_madison_1813 honorable 0.084762
493573 08_monroe_1817 states 0.184195
489653 08_monroe_1817 government 0.174125
489687 08_monroe_1817 great 0.160658
494415 08_monroe_1817 union 0.117193
491769 08_monroe_1817 people 0.112825
494418 08_monroe_1817 united 0.112076
487979 08_monroe_1817 dangers 0.108567
491313 08_monroe_1817 naval 0.104713
489410 08_monroe_1817 foreign 0.103460
492121 08_monroe_1817 principles 0.097766
183721 09_monroe_1821 great 0.173751
187607 09_monroe_1821 states 0.137384
186897 09_monroe_1821 revenue 0.115018
188745 09_monroe_1821 war 0.113785
185730 09_monroe_1821 parties 0.109318
188452 09_monroe_1821 united 0.108029
181484 09_monroe_1821 commerce 0.105001
183432 09_monroe_1821 force 0.102947
183478 09_monroe_1821 fortifications 0.098741
188014 09_monroe_1821 term 0.094808
431422 10_adams_john_quincy_1825 union 0.257335
426660 10_adams_john_quincy_1825 government 0.147726
426595 10_adams_john_quincy_1825 general 0.109221
429922 10_adams_john_quincy_1825 rights 0.096300
425471 10_adams_john_quincy_1825 dissensions 0.095289
429304 10_adams_john_quincy_1825 public 0.094573
424714 10_adams_john_quincy_1825 constitution 0.090300
428754 10_adams_john_quincy_1825 peace 0.088183
424871 10_adams_john_quincy_1825 country 0.086898
428791 10_adams_john_quincy_1825 performance 0.085565
96341 11_jackson_1829 public 0.160747
93633 11_jackson_1829 generally 0.122711
92371 11_jackson_1829 diffidence 0.112691
92130 11_jackson_1829 defending 0.105878
97272 11_jackson_1829 shall 0.104933
96907 11_jackson_1829 revenue 0.102776
98938 11_jackson_1829 worth 0.100312
93697 11_jackson_1829 government 0.099698
93306 11_jackson_1829 federal 0.093100
96034 11_jackson_1829 power 0.092071
89460 12_jackson_1833 union 0.212766
84698 12_jackson_1833 government 0.207559
88618 12_jackson_1833 states 0.141549
86814 12_jackson_1833 people 0.136557
87114 12_jackson_1833 preservation 0.128319
84633 12_jackson_1833 general 0.125422
84083 12_jackson_1833 exercise 0.119275
85236 12_jackson_1833 inculcate 0.116720
87285 12_jackson_1833 proportion 0.116720
87038 12_jackson_1833 powers 0.113757
4427 13_van_buren_1837 institutions 0.186889
5823 13_van_buren_1837 people 0.138465
3707 13_van_buren_1837 government 0.116561
7872 13_van_buren_1837 supposed 0.109949
1918 13_van_buren_1837 country 0.109276
243 13_van_buren_1837 actual 0.096382
3144 13_van_buren_1837 experience 0.093444
267 13_van_buren_1837 adherence 0.083833
1639 13_van_buren_1837 conduct 0.081635
5578 13_van_buren_1837 opinions 0.081597
51039 14_harrison_1841 power 0.204207
46756 14_harrison_1841 constitution 0.183336
48081 14_harrison_1841 executive 0.157153
50818 14_harrison_1841 people 0.141584
48702 14_harrison_1841 government 0.141142
52008 14_harrison_1841 roman 0.110538
52622 14_harrison_1841 states 0.108621
46367 14_harrison_1841 citizens 0.105857
46300 14_harrison_1841 character 0.102640
52617 14_harrison_1841 state 0.094976
314435 15_polk_1845 union 0.259054
309673 15_polk_1845 government 0.256967
313593 15_polk_1845 states 0.218122
314021 15_polk_1845 texas 0.199846
312883 15_polk_1845 revenue 0.146541
312013 15_polk_1845 powers 0.124655
312287 15_polk_1845 protection 0.107385
307727 15_polk_1845 constitution 0.106528
310443 15_polk_1845 interests 0.105054
309149 15_polk_1845 extended 0.090179
367242 16_taylor_1849 shall 0.266204
363667 16_taylor_1849 government 0.118031
362624 16_taylor_1849 duties 0.117893
365432 16_taylor_1849 object 0.104293
361650 16_taylor_1849 congress 0.103865
366328 16_taylor_1849 purity 0.101793
368626 16_taylor_1849 vested 0.101793
365066 16_taylor_1849 measures 0.101637
361878 16_taylor_1849 country 0.101169
360297 16_taylor_1849 affections 0.097017
39837 17_pierce_1853 hardly 0.114001
42040 17_pierce_1853 power 0.102456
42011 17_pierce_1853 position 0.086643
37758 17_pierce_1853 constitutional 0.086105
39123 17_pierce_1853 expect 0.084436
39703 17_pierce_1853 government 0.084048
36538 17_pierce_1853 apparent 0.080332
42621 17_pierce_1853 regarded 0.080332
43278 17_pierce_1853 shall 0.079615
40859 17_pierce_1853 like 0.079229
358588 18_buchanan_1857 states 0.208199
352722 18_buchanan_1857 constitution 0.188573
358243 18_buchanan_1857 shall 0.161784
357359 18_buchanan_1857 question 0.157007
359805 18_buchanan_1857 whilst 0.141119
359006 18_buchanan_1857 territory 0.140852
359430 18_buchanan_1857 union 0.126444
354668 18_buchanan_1857 government 0.119554
352651 18_buchanan_1857 congress 0.118357
356784 18_buchanan_1857 people 0.105501
208738 19_lincoln_1861 constitution 0.214478
215446 19_lincoln_1861 union 0.203738
208210 19_lincoln_1861 case 0.152422
214604 19_lincoln_1861 states 0.144861
212181 19_lincoln_1861 minority 0.131514
212800 19_lincoln_1861 people 0.130763
208372 19_lincoln_1861 clause 0.125738
210684 19_lincoln_1861 government 0.123837
214259 19_lincoln_1861 shall 0.123099
211735 19_lincoln_1861 law 0.122872
413720 20_lincoln_1865 war 0.267217
410490 20_lincoln_1865 offenses 0.234524
413868 20_lincoln_1865 woe 0.234524
408646 20_lincoln_1865 god 0.151269
410489 20_lincoln_1865 offense 0.141890
413830 20_lincoln_1865 wills 0.141890
405466 20_lincoln_1865 answered 0.131631
412370 20_lincoln_1865 slaves 0.123674
413424 20_lincoln_1865 union 0.114955
405400 20_lincoln_1865 altogether 0.111675
128578 21_grant_1869 dollar 0.270439
131782 21_grant_1869 paying 0.162263
128041 21_grant_1869 deal 0.152454
133513 21_grant_1869 specie 0.152454
128056 21_grant_1869 debt 0.135097
127904 21_grant_1869 country 0.127604
126308 21_grant_1869 advisable 0.116606
130751 21_grant_1869 laws 0.115834
131784 21_grant_1869 payments 0.108175
131780 21_grant_1869 pay 0.098658
303269 22_grant_1873 proposition 0.187222
299570 22_grant_1873 domingo 0.177516
304058 22_grant_1873 santo 0.177516
305193 22_grant_1873 transit 0.177516
305012 22_grant_1873 territory 0.121158
300160 22_grant_1873 extermination 0.118344
304617 22_grant_1873 steam 0.118344
304969 22_grant_1873 telegraph 0.118344
298885 22_grant_1873 country 0.117529
300153 22_grant_1873 extension 0.116618
397874 23_hayes_1877 country 0.186357
399663 23_hayes_1877 government 0.167722
396868 23_hayes_1877 behalf 0.128316
402307 23_hayes_1877 public 0.123944
401944 23_hayes_1877 political 0.121034
403583 23_hayes_1877 states 0.113587
401713 23_hayes_1877 party 0.112549
398461 23_hayes_1877 dispute 0.112503
401706 23_hayes_1877 parties 0.109554
402561 23_hayes_1877 reform 0.104365
291675 24_garfield_1881 government 0.186855
293791 24_garfield_1881 people 0.162132
289729 24_garfield_1881 constitution 0.158292
295595 24_garfield_1881 states 0.135047
296437 24_garfield_1881 union 0.132321
295778 24_garfield_1881 suffrage 0.119992
293368 24_garfield_1881 negro 0.118782
288756 24_garfield_1881 authority 0.117232
289658 24_garfield_1881 congress 0.112598
292726 24_garfield_1881 law 0.103639
68816 25_cleveland_1885 people 0.210468
66700 25_cleveland_1885 government 0.209164
68744 25_cleveland_1885 partisan 0.169436
69344 25_cleveland_1885 public 0.163662
70275 25_cleveland_1885 shall 0.129498
64754 25_cleveland_1885 constitution 0.127856
67470 25_cleveland_1885 interests 0.118207
66197 25_cleveland_1885 extravagance 0.111416
64363 25_cleveland_1885 citizen 0.102825
70708 25_cleveland_1885 strife 0.101661
383781 26_harrison_1889 people 0.172358
382723 26_harrison_1889 laws 0.154418
385585 26_harrison_1889 states 0.138614
378804 26_harrison_1889 ballot 0.137159
384309 26_harrison_1889 public 0.128566
383119 26_harrison_1889 methods 0.119162
385240 26_harrison_1889 shall 0.118483
381527 26_harrison_1889 friendly 0.104267
380968 26_harrison_1889 european 0.103349
379719 26_harrison_1889 constitution 0.089360
446774 27_cleveland_1893 people 0.221563
444658 27_cleveland_1893 government 0.148364
444533 27_cleveland_1893 frugality 0.128050
447302 27_cleveland_1893 public 0.102520
448203 27_cleveland_1893 service 0.101813
448817 27_cleveland_1893 support 0.099946
441413 27_cleveland_1893 american 0.097267
441192 27_cleveland_1893 activity 0.095964
444659 27_cleveland_1893 governmental 0.095964
442870 27_cleveland_1893 countrymen 0.088564
280659 28_mckinley_1897 congress 0.188773
285886 28_mckinley_1897 revenue 0.168489
284792 28_mckinley_1897 people 0.161797
282676 28_mckinley_1897 government 0.156633
283868 28_mckinley_1897 loans 0.149356
283766 28_mckinley_1897 legislation 0.126367
285320 28_mckinley_1897 public 0.107057
280123 28_mckinley_1897 business 0.106759
282710 28_mckinley_1897 great 0.105322
285900 28_mckinley_1897 revision 0.099571
229571 29_mckinley_1901 islands 0.216480
226964 29_mckinley_1901 cuba 0.206329
228682 29_mckinley_1901 government 0.153681
228061 29_mckinley_1901 executive 0.147843
229329 29_mckinley_1901 inhabitants 0.147374
226665 29_mckinley_1901 congress 0.141999
230798 29_mckinley_1901 people 0.116839
232602 29_mckinley_1901 states 0.102186
233447 29_mckinley_1901 united 0.100439
231074 29_mckinley_1901 preparation 0.097925
168610 30_roosevelt_theodore_1905 regards 0.199163
168177 30_roosevelt_theodore_1905 problems 0.182463
169958 30_roosevelt_theodore_1905 tasks 0.150068
162600 30_roosevelt_theodore_1905 aright 0.146306
168765 30_roosevelt_theodore_1905 republic 0.121428
166828 30_roosevelt_theodore_1905 life 0.118701
163230 30_roosevelt_theodore_1905 cause 0.116483
165199 30_roosevelt_theodore_1905 faced 0.115730
163619 30_roosevelt_theodore_1905 conditions 0.115373
170877 30_roosevelt_theodore_1905 wish 0.106771
517444 31_taft_1909 interstate 0.206957
514097 31_taft_1909 business 0.201378
520913 31_taft_1909 tariff 0.154802
518343 31_taft_1909 negro 0.153669
520442 31_taft_1909 south 0.129384
516650 31_taft_1909 government 0.121451
519229 31_taft_1909 proper 0.114684
519354 31_taft_1909 race 0.113413
516265 31_taft_1909 feeling 0.111255
514129 31_taft_1909 canal 0.110879
201719 32_wilson_1913 great 0.158659
203109 32_wilson_1913 men 0.142924
201245 32_wilson_1913 familiar 0.141669
205651 32_wilson_1913 stirred 0.141669
205718 32_wilson_1913 studied 0.141669
206055 32_wilson_1913 things 0.123915
202644 32_wilson_1913 justice 0.105759
201685 32_wilson_1913 government 0.105520
202824 32_wilson_1913 life 0.102168
202901 32_wilson_1913 look 0.100095
152880 33_wilson_1917 wished 0.228593
145888 33_wilson_1917 counsel 0.174639
150355 33_wilson_1917 purpose 0.152933
144219 33_wilson_1917 action 0.149960
151266 33_wilson_1917 shall 0.134404
152075 33_wilson_1917 thought 0.126568
151593 33_wilson_1917 stand 0.121313
151243 33_wilson_1917 set 0.111215
149976 33_wilson_1917 politics 0.108510
146621 33_wilson_1917 drawn 0.104783
251911 34_harding_1921 world 0.196268
244352 34_harding_1921 civilization 0.157095
243434 34_harding_1921 america 0.155684
251738 34_harding_1921 war 0.120631
249645 34_harding_1921 relationship 0.118846
249756 34_harding_1921 republic 0.117160
248578 34_harding_1921 order 0.110128
251362 34_harding_1921 understanding 0.109915
248386 34_harding_1921 new 0.097567
243442 34_harding_1921 amid 0.095077
262889 35_coolidge_1925 country 0.120814
266602 35_coolidge_1925 ought 0.116721
267750 35_coolidge_1925 represents 0.114495
268950 35_coolidge_1925 tax 0.112826
264712 35_coolidge_1925 great 0.109908
267259 35_coolidge_1925 property 0.108285
266728 35_coolidge_1925 party 0.107300
268585 35_coolidge_1925 stands 0.107257
266772 35_coolidge_1925 peace 0.104170
266794 35_coolidge_1925 people 0.101306
106831 36_hoover_1929 sup 0.296865
102696 36_hoover_1929 government 0.202690
101845 36_hoover_1929 enforcement 0.194371
99049 36_hoover_1929 18th 0.134706
105231 36_hoover_1929 progress 0.132406
102305 36_hoover_1929 federal 0.126183
103050 36_hoover_1929 ideals 0.113418
100143 36_hoover_1929 business 0.108323
103754 36_hoover_1929 laws 0.107051
104790 36_hoover_1929 peace 0.103883
345889 37_roosevelt_franklin_1933 helped 0.215644
346734 37_roosevelt_franklin_1933 leadership 0.191084
349671 37_roosevelt_franklin_1933 stricken 0.129390
344742 37_roosevelt_franklin_1933 emergency 0.123225
344399 37_roosevelt_franklin_1933 discipline 0.117971
348804 37_roosevelt_franklin_1933 respects 0.117971
347233 37_roosevelt_franklin_1933 money 0.113371
347316 37_roosevelt_franklin_1933 national 0.110570
348520 37_roosevelt_franklin_1933 recovery 0.102349
342197 37_roosevelt_franklin_1933 action 0.097007
335172 38_roosevelt_franklin_1937 democracy 0.178041
336670 38_roosevelt_franklin_1937 government 0.177222
338146 38_roosevelt_franklin_1937 millions 0.140722
338671 38_roosevelt_franklin_1937 paint 0.121461
338786 38_roosevelt_franklin_1937 people 0.115789
335664 38_roosevelt_franklin_1937 economic 0.114184
339954 38_roosevelt_franklin_1937 road 0.112944
339205 38_roosevelt_franklin_1937 progress 0.104463
335255 38_roosevelt_franklin_1937 despair 0.100302
338316 38_roosevelt_franklin_1937 nation 0.099688
272179 39_roosevelt_franklin_1941 democracy 0.244486
274674 39_roosevelt_franklin_1941 know 0.189060
277494 39_roosevelt_franklin_1941 speaks 0.183385
271040 39_roosevelt_franklin_1941 br 0.163241
275323 39_roosevelt_franklin_1941 nation 0.162241
270431 39_roosevelt_franklin_1941 america 0.140133
274816 39_roosevelt_franklin_1941 life 0.117815
277525 39_roosevelt_franklin_1941 spirit 0.114445
273521 39_roosevelt_franklin_1941 freedom 0.109295
270043 39_roosevelt_franklin_1941 1941 0.108911
472725 40_roosevelt_franklin_1945 learned 0.300396
475997 40_roosevelt_franklin_1945 test 0.194731
468022 40_roosevelt_franklin_1945 1945 0.189849
475230 40_roosevelt_franklin_1945 shall 0.173637
476211 40_roosevelt_franklin_1945 trend 0.172292
473184 40_roosevelt_franklin_1945 mistakes 0.159835
473749 40_roosevelt_franklin_1945 peace 0.159442
476096 40_roosevelt_franklin_1945 today 0.154299
476538 40_roosevelt_franklin_1945 upward 0.150172
471569 40_roosevelt_franklin_1945 gain 0.142278
143923 41_truman_1949 world 0.196051
140343 41_truman_1949 nations 0.194029
141225 41_truman_1949 program 0.171656
140809 41_truman_1949 peoples 0.166989
137194 41_truman_1949 democracy 0.154140
138536 41_truman_1949 freedom 0.149297
136513 41_truman_1949 communism 0.147134
140786 41_truman_1949 peace 0.144167
136902 41_truman_1949 countries 0.137013
141543 41_truman_1949 recovery 0.135785
462497 42_eisenhower_1953 free 0.205803
462199 42_eisenhower_1953 faith 0.154561
467887 42_eisenhower_1953 world 0.146449
464773 42_eisenhower_1953 peoples 0.139466
465172 42_eisenhower_1953 productivity 0.133826
466648 42_eisenhower_1953 strength 0.130430
464750 42_eisenhower_1953 peace 0.123845
462500 42_eisenhower_1953 freedom 0.123320
466231 42_eisenhower_1953 shall 0.105970
462919 42_eisenhower_1953 hold 0.105028
485885 43_eisenhower_1957 world 0.193893
480498 43_eisenhower_1957 freedom 0.179599
484143 43_eisenhower_1957 seek 0.176008
482305 43_eisenhower_1957 nations 0.175538
482771 43_eisenhower_1957 peoples 0.158270
484670 43_eisenhower_1957 strives 0.146428
482748 43_eisenhower_1957 peace 0.136639
480873 43_eisenhower_1957 help 0.132688
479515 43_eisenhower_1957 divided 0.115450
482262 43_eisenhower_1957 mr 0.111886
391774 44_kennedy_1961 let 0.267869
394306 44_kennedy_1961 sides 0.262849
392921 44_kennedy_1961 pledge 0.160960
387632 44_kennedy_1961 ask 0.107713
387864 44_kennedy_1961 begin 0.106495
388991 44_kennedy_1961 dare 0.106495
395895 44_kennedy_1961 world 0.103110
390313 44_kennedy_1961 final 0.102311
392370 44_kennedy_1961 new 0.096600
390120 44_kennedy_1961 explore 0.094223
109283 45_johnson_1965 change 0.276090
109919 45_johnson_1965 covenant 0.242891
113001 45_johnson_1965 man 0.174391
113063 45_johnson_1965 mastery 0.153532
113341 45_johnson_1965 nation 0.152475
116457 45_johnson_1965 union 0.150512
113540 45_johnson_1965 old 0.129184
116299 45_johnson_1965 trying 0.109663
113811 45_johnson_1965 people 0.108677
111846 45_johnson_1965 harvest 0.102355
458677 46_nixon_1969 voices 0.208854
455751 46_nixon_1969 peace 0.144624
454767 46_nixon_1969 let 0.140977
452636 46_nixon_1969 earth 0.139513
454654 46_nixon_1969 know 0.137969
454963 46_nixon_1969 man 0.135416
455773 46_nixon_1969 people 0.131270
458888 46_nixon_1969 world 0.128264
456896 46_nixon_1969 rhetoric 0.119219
453462 46_nixon_1969 forward 0.113215
9460 47_nixon_1973 america 0.307074
13816 47_nixon_1973 let 0.282212
14800 47_nixon_1973 peace 0.211567
16008 47_nixon_1973 role 0.190395
17937 47_nixon_1973 world 0.177760
14983 47_nixon_1973 policies 0.176224
15848 47_nixon_1973 responsibility 0.164016
14412 47_nixon_1973 new 0.158606
9147 47_nixon_1973 abroad 0.154815
12977 47_nixon_1973 home 0.126653
190049 48_carter_1977 br 0.222574
194332 48_carter_1977 nation 0.191717
191619 48_carter_1977 dream 0.181515
196678 48_carter_1977 strength 0.147104
194392 48_carter_1977 new 0.142111
194143 48_carter_1977 micah 0.118797
197040 48_carter_1977 thee 0.107811
196534 48_carter_1977 spirit 0.107000
193001 48_carter_1977 human 0.101203
191853 48_carter_1977 enhance 0.100016
156690 49_reagan_1981 government 0.162397
153447 49_reagan_1981 americans 0.156895
156925 49_reagan_1981 heroes 0.137410
153904 49_reagan_1981 believe 0.136126
161636 49_reagan_1981 ve 0.115339
159206 49_reagan_1981 productivity 0.104753
161801 49_reagan_1981 weapon 0.104753
156534 49_reagan_1981 freedom 0.102964
155625 49_reagan_1981 dreams 0.101106
161131 49_reagan_1981 today 0.093813
21705 50_reagan_1985 government 0.161165
21549 50_reagan_1985 freedom 0.159998
23452 50_reagan_1985 nuclear 0.153623
26651 50_reagan_1985 ve 0.153623
26817 50_reagan_1985 weapons 0.140173
23821 50_reagan_1985 people 0.137038
26936 50_reagan_1985 world 0.127236
21963 50_reagan_1985 history 0.104777
22020 50_reagan_1985 human 0.104777
25219 50_reagan_1985 senator 0.102416
119593 51_bush_george_h_w_1989 don 0.186313
118075 51_bush_george_h_w_1989 breeze 0.184416
122400 51_bush_george_h_w_1989 new 0.137266
120557 51_bush_george_h_w_1989 friends 0.136820
119597 51_bush_george_h_w_1989 door 0.133889
125912 51_bush_george_h_w_1989 word 0.131722
122302 51_bush_george_h_w_1989 mr 0.126821
120800 51_bush_george_h_w_1989 hand 0.125086
118005 51_bush_george_h_w_1989 blowing 0.110649
125064 51_bush_george_h_w_1989 things 0.110609
252433 52_clinton_1993 america 0.318908
260910 52_clinton_1993 world 0.226715
252436 52_clinton_1993 americans 0.206865
260120 52_clinton_1993 today 0.185539
253267 52_clinton_1993 change 0.170522
258709 52_clinton_1993 renewal 0.136867
259135 52_clinton_1993 season 0.136867
256028 52_clinton_1993 idea 0.134993
256789 52_clinton_1993 let 0.132521
257795 52_clinton_1993 people 0.129272
28273 53_clinton_1997 century 0.321300
32410 53_clinton_1997 new 0.279600
27458 53_clinton_1997 america 0.199997
33257 53_clinton_1997 promise 0.164327
35935 53_clinton_1997 world 0.135071
31724 53_clinton_1997 land 0.131027
32350 53_clinton_1997 nation 0.117062
27461 53_clinton_1997 americans 0.115029
35131 53_clinton_1997 time 0.108057
31814 53_clinton_1997 let 0.105270
322651 54_bush_george_w_2001 story 0.341166
315426 54_bush_george_w_2001 america 0.193152
316343 54_bush_george_w_2001 civility 0.160853
320318 54_bush_george_w_2001 nation 0.130448
315304 54_bush_george_w_2001 affirm 0.120640
319026 54_bush_george_w_2001 ideals 0.109491
315429 54_bush_george_w_2001 americans 0.108207
321225 54_bush_george_w_2001 promise 0.108207
316509 54_bush_george_w_2001 compassion 0.107388
316337 54_bush_george_w_2001 citizens 0.106730
435503 55_bush_george_w_2005 freedom 0.349948
432413 55_bush_george_w_2005 america 0.284882
436792 55_bush_george_w_2005 liberty 0.174494
432416 55_bush_george_w_2005 americans 0.140443
440278 55_bush_george_w_2005 tyranny 0.127272
439154 55_bush_george_w_2005 seen 0.110386
437305 55_bush_george_w_2005 nation 0.096199
433200 55_bush_george_w_2005 cause 0.092545
435917 55_bush_george_w_2005 history 0.092422
433132 55_bush_george_w_2005 came 0.091988
54455 56_obama_2009 america 0.148351
59347 56_obama_2009 nation 0.120229
59407 56_obama_2009 new 0.118002
62142 56_obama_2009 today 0.114792
57639 56_obama_2009 generation 0.100654
58811 56_obama_2009 let 0.091100
58627 56_obama_2009 jobs 0.090727
55960 56_obama_2009 crisis 0.087235
57828 56_obama_2009 hard 0.084859
62910 56_obama_2009 women 0.084859
418595 57_obama_2013 journey 0.167591
415909 57_obama_2013 creed 0.139659
417599 57_obama_2013 generation 0.127260
414415 57_obama_2013 america 0.125044
415519 57_obama_2013 complete 0.114891
420751 57_obama_2013 requires 0.114891
419777 57_obama_2013 people 0.110351
422088 57_obama_2013 time 0.105563
422102 57_obama_2013 today 0.103668
416980 57_obama_2013 evident 0.100896
504405 58_trump_2017 america 0.350162
506586 58_trump_2017 dreams 0.156436
504406 58_trump_2017 american 0.149226
508577 58_trump_2017 jobs 0.142766
510263 58_trump_2017 protected 0.132439
509410 58_trump_2017 obama 0.120288
509767 58_trump_2017 people 0.112370
512002 58_trump_2017 thank 0.109171
504990 58_trump_2017 borders 0.107075
512597 58_trump_2017 ve 0.107075
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

We can zoom in on particular words and particular documents.

top_tfidf[top_tfidf['term'].str.contains('women')]
document term tfidf
62910 56_obama_2009 women 0.084859

It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.

top_tfidf[top_tfidf['document'].str.contains('obama')]
document term tfidf
54455 56_obama_2009 america 0.148351
59347 56_obama_2009 nation 0.120229
59407 56_obama_2009 new 0.118002
62142 56_obama_2009 today 0.114792
57639 56_obama_2009 generation 0.100654
58811 56_obama_2009 let 0.091100
58627 56_obama_2009 jobs 0.090727
55960 56_obama_2009 crisis 0.087235
57828 56_obama_2009 hard 0.084859
62910 56_obama_2009 women 0.084859
418595 57_obama_2013 journey 0.167591
415909 57_obama_2013 creed 0.139659
417599 57_obama_2013 generation 0.127260
414415 57_obama_2013 america 0.125044
415519 57_obama_2013 complete 0.114891
420751 57_obama_2013 requires 0.114891
419777 57_obama_2013 people 0.110351
422088 57_obama_2013 time 0.105563
422102 57_obama_2013 today 0.103668
416980 57_obama_2013 evident 0.100896
top_tfidf[top_tfidf['document'].str.contains('trump')]
document term tfidf
504405 58_trump_2017 america 0.350162
506586 58_trump_2017 dreams 0.156436
504406 58_trump_2017 american 0.149226
508577 58_trump_2017 jobs 0.142766
510263 58_trump_2017 protected 0.132439
509410 58_trump_2017 obama 0.120288
509767 58_trump_2017 people 0.112370
512002 58_trump_2017 thank 0.109171
504990 58_trump_2017 borders 0.107075
512597 58_trump_2017 ve 0.107075
top_tfidf[top_tfidf['document'].str.contains('kennedy')]
document term tfidf
391774 44_kennedy_1961 let 0.267869
394306 44_kennedy_1961 sides 0.262849
392921 44_kennedy_1961 pledge 0.160960
387632 44_kennedy_1961 ask 0.107713
387864 44_kennedy_1961 begin 0.106495
388991 44_kennedy_1961 dare 0.106495
395895 44_kennedy_1961 world 0.103110
390313 44_kennedy_1961 final 0.102311
392370 44_kennedy_1961 new 0.096600
390120 44_kennedy_1961 explore 0.094223

Visualize TF-IDF

We can also visualize our TF-IDF results with the data visualization library Altair.

!pip install altair

Let’s make a heatmap that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: “war” and “peace”:

The code below was contributed by Eric Monson. Thanks, Eric!

import altair as alt
import numpy as np

# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)

Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

1. What is the difference between a tf-idf score and raw word frequency?

2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?