TF-IDF with Scikit-Learn

In the previous lesson, we learned about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf. Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. We specifically learned how to calculate tf-idf scores using word frequencies per page—or “extracted features”—made available by the HathiTrust Digital Library.

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

  • Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn

Dataset

U.S. Inaugural Addresses

This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.

—Barack Obama, Inaugural Presidential Address, January 2009

During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.

Breaking Down the TF-IDF Formula

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

tf-idf = term_frequency * inverse_document_frequency

term_frequency = number of times a given term appears in document

inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

TF-IDF with scikit-learn

scikit-learn, imported as sklearn, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer and CountVectorizer.

Install scikit-learn

!pip install sklearn
Requirement already satisfied: sklearn in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (0.0)
Requirement already satisfied: scikit-learn in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (from sklearn) (0.20.3)
Requirement already satisfied: numpy>=1.8.2 in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (from scikit-learn->sklearn) (1.17.4)
Requirement already satisfied: scipy>=0.13.3 in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (from scikit-learn->sklearn) (1.3.1)

Import necessary modules and libraries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
from pathlib import Path  
import glob

We’re also going to import pandas and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: pathlib and glob.

Set Directory Path

Below we’re setting the directory filepath that contains all the text files that we want to analyze.

directory_path = "../texts/history/US_Inaugural_Addresses/"

Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles.

text_files = glob.glob(f"{directory_path}/*.txt")
text_files
['../texts/history/US_Inaugural_Addresses/01_washington_1789.txt',
 '../texts/history/US_Inaugural_Addresses/02_washington_1793.txt',
 '../texts/history/US_Inaugural_Addresses/03_adams_john_1797.txt',
 '../texts/history/US_Inaugural_Addresses/04_jefferson_1801.txt',
 '../texts/history/US_Inaugural_Addresses/05_jefferson_1805.txt',
 '../texts/history/US_Inaugural_Addresses/06_madison_1809.txt',
 '../texts/history/US_Inaugural_Addresses/07_madison_1813.txt',
 '../texts/history/US_Inaugural_Addresses/08_monroe_1817.txt',
 '../texts/history/US_Inaugural_Addresses/09_monroe_1821.txt',
 '../texts/history/US_Inaugural_Addresses/10_adams_john_quincy_1825.txt',
 '../texts/history/US_Inaugural_Addresses/11_jackson_1829.txt',
 '../texts/history/US_Inaugural_Addresses/12_jackson_1833.txt',
 '../texts/history/US_Inaugural_Addresses/13_van_buren_1837.txt',
 '../texts/history/US_Inaugural_Addresses/14_harrison_1841.txt',
 '../texts/history/US_Inaugural_Addresses/15_polk_1845.txt',
 '../texts/history/US_Inaugural_Addresses/16_taylor_1849.txt',
 '../texts/history/US_Inaugural_Addresses/17_pierce_1853.txt',
 '../texts/history/US_Inaugural_Addresses/18_buchanan_1857.txt',
 '../texts/history/US_Inaugural_Addresses/19_lincoln_1861.txt',
 '../texts/history/US_Inaugural_Addresses/20_lincoln_1865.txt',
 '../texts/history/US_Inaugural_Addresses/21_grant_1869.txt',
 '../texts/history/US_Inaugural_Addresses/22_grant_1873.txt',
 '../texts/history/US_Inaugural_Addresses/23_hayes_1877.txt',
 '../texts/history/US_Inaugural_Addresses/24_garfield_1881.txt',
 '../texts/history/US_Inaugural_Addresses/25_cleveland_1885.txt',
 '../texts/history/US_Inaugural_Addresses/26_harrison_1889.txt',
 '../texts/history/US_Inaugural_Addresses/27_cleveland_1893.txt',
 '../texts/history/US_Inaugural_Addresses/28_mckinley_1897.txt',
 '../texts/history/US_Inaugural_Addresses/29_mckinley_1901.txt',
 '../texts/history/US_Inaugural_Addresses/30_roosevelt_theodore_1905.txt',
 '../texts/history/US_Inaugural_Addresses/31_taft_1909.txt',
 '../texts/history/US_Inaugural_Addresses/32_wilson_1913.txt',
 '../texts/history/US_Inaugural_Addresses/33_wilson_1917.txt',
 '../texts/history/US_Inaugural_Addresses/34_harding_1921.txt',
 '../texts/history/US_Inaugural_Addresses/35_coolidge_1925.txt',
 '../texts/history/US_Inaugural_Addresses/36_hoover_1929.txt',
 '../texts/history/US_Inaugural_Addresses/37_roosevelt_franklin_1933.txt',
 '../texts/history/US_Inaugural_Addresses/38_roosevelt_franklin_1937.txt',
 '../texts/history/US_Inaugural_Addresses/39_roosevelt_franklin_1941.txt',
 '../texts/history/US_Inaugural_Addresses/40_roosevelt_franklin_1945.txt',
 '../texts/history/US_Inaugural_Addresses/41_truman_1949.txt',
 '../texts/history/US_Inaugural_Addresses/42_eisenhower_1953.txt',
 '../texts/history/US_Inaugural_Addresses/43_eisenhower_1957.txt',
 '../texts/history/US_Inaugural_Addresses/44_kennedy_1961.txt',
 '../texts/history/US_Inaugural_Addresses/45_johnson_1965.txt',
 '../texts/history/US_Inaugural_Addresses/46_nixon_1969.txt',
 '../texts/history/US_Inaugural_Addresses/47_nixon_1973.txt',
 '../texts/history/US_Inaugural_Addresses/48_carter_1977.txt',
 '../texts/history/US_Inaugural_Addresses/49_reagan_1981.txt',
 '../texts/history/US_Inaugural_Addresses/50_reagan_1985.txt',
 '../texts/history/US_Inaugural_Addresses/51_bush_george_h_w_1989.txt',
 '../texts/history/US_Inaugural_Addresses/52_clinton_1993.txt',
 '../texts/history/US_Inaugural_Addresses/53_clinton_1997.txt',
 '../texts/history/US_Inaugural_Addresses/54_bush_george_w_2001.txt',
 '../texts/history/US_Inaugural_Addresses/55_bush_george_w_2005.txt',
 '../texts/history/US_Inaugural_Addresses/56_obama_2009.txt',
 '../texts/history/US_Inaugural_Addresses/57_obama_2013.txt',
 '../texts/history/US_Inaugural_Addresses/58_trump_2017.txt']
text_titles = [Path(text).stem for text in text_files]
text_titles
['01_washington_1789',
 '02_washington_1793',
 '03_adams_john_1797',
 '04_jefferson_1801',
 '05_jefferson_1805',
 '06_madison_1809',
 '07_madison_1813',
 '08_monroe_1817',
 '09_monroe_1821',
 '10_adams_john_quincy_1825',
 '11_jackson_1829',
 '12_jackson_1833',
 '13_van_buren_1837',
 '14_harrison_1841',
 '15_polk_1845',
 '16_taylor_1849',
 '17_pierce_1853',
 '18_buchanan_1857',
 '19_lincoln_1861',
 '20_lincoln_1865',
 '21_grant_1869',
 '22_grant_1873',
 '23_hayes_1877',
 '24_garfield_1881',
 '25_cleveland_1885',
 '26_harrison_1889',
 '27_cleveland_1893',
 '28_mckinley_1897',
 '29_mckinley_1901',
 '30_roosevelt_theodore_1905',
 '31_taft_1909',
 '32_wilson_1913',
 '33_wilson_1917',
 '34_harding_1921',
 '35_coolidge_1925',
 '36_hoover_1929',
 '37_roosevelt_franklin_1933',
 '38_roosevelt_franklin_1937',
 '39_roosevelt_franklin_1941',
 '40_roosevelt_franklin_1945',
 '41_truman_1949',
 '42_eisenhower_1953',
 '43_eisenhower_1957',
 '44_kennedy_1961',
 '45_johnson_1965',
 '46_nixon_1969',
 '47_nixon_1973',
 '48_carter_1977',
 '49_reagan_1981',
 '50_reagan_1985',
 '51_bush_george_h_w_1989',
 '52_clinton_1993',
 '53_clinton_1997',
 '54_bush_george_w_2001',
 '55_bush_george_w_2005',
 '56_obama_2009',
 '57_obama_2013',
 '58_trump_2017']

Calculate tf–idf

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our text_files

tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

Add column for document frequency aka number of times word appears in all documents

tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)
government borders people obama war honor foreign men women children
00_Document Frequency 53.00 5.00 56.00 3.00 45.00 32.00 32.00 47.00 15.00 22.00
01_washington_1789 0.11 0.00 0.05 0.00 0.00 0.00 0.00 0.02 0.00 0.00
02_washington_1793 0.06 0.00 0.05 0.00 0.00 0.08 0.00 0.00 0.00 0.00
03_adams_john_1797 0.16 0.00 0.19 0.00 0.01 0.10 0.12 0.04 0.00 0.00
04_jefferson_1801 0.16 0.00 0.01 0.00 0.01 0.04 0.00 0.04 0.00 0.00
05_jefferson_1805 0.03 0.00 0.00 0.00 0.04 0.00 0.06 0.01 0.00 0.02
06_madison_1809 0.00 0.00 0.02 0.00 0.02 0.05 0.05 0.00 0.00 0.00
07_madison_1813 0.04 0.00 0.04 0.00 0.25 0.02 0.02 0.00 0.00 0.00
08_monroe_1817 0.17 0.00 0.11 0.00 0.09 0.01 0.10 0.04 0.00 0.00
09_monroe_1821 0.08 0.00 0.06 0.00 0.11 0.02 0.04 0.01 0.00 0.01
10_adams_john_quincy_1825 0.15 0.00 0.06 0.00 0.05 0.01 0.08 0.03 0.00 0.00
11_jackson_1829 0.10 0.00 0.06 0.00 0.02 0.02 0.07 0.02 0.00 0.00
12_jackson_1833 0.21 0.00 0.14 0.00 0.00 0.00 0.02 0.00 0.00 0.00
13_van_buren_1837 0.12 0.00 0.14 0.00 0.02 0.02 0.06 0.02 0.00 0.01
14_harrison_1841 0.14 0.00 0.14 0.00 0.01 0.02 0.03 0.03 0.00 0.00
15_polk_1845 0.26 0.00 0.08 0.00 0.03 0.01 0.09 0.02 0.00 0.01
16_taylor_1849 0.12 0.00 0.05 0.00 0.00 0.02 0.05 0.00 0.00 0.00
17_pierce_1853 0.08 0.00 0.05 0.00 0.00 0.02 0.04 0.01 0.00 0.03
18_buchanan_1857 0.12 0.00 0.11 0.00 0.08 0.01 0.04 0.03 0.00 0.05
19_lincoln_1861 0.12 0.00 0.13 0.00 0.02 0.00 0.02 0.00 0.00 0.00
20_lincoln_1865 0.02 0.00 0.00 0.00 0.27 0.00 0.00 0.04 0.00 0.00
21_grant_1869 0.05 0.00 0.03 0.00 0.02 0.05 0.05 0.02 0.00 0.00
22_grant_1873 0.06 0.00 0.10 0.00 0.05 0.02 0.00 0.00 0.00 0.00
23_hayes_1877 0.17 0.00 0.08 0.00 0.00 0.00 0.04 0.02 0.00 0.00
24_garfield_1881 0.19 0.00 0.16 0.00 0.05 0.00 0.00 0.01 0.00 0.04
25_cleveland_1885 0.21 0.00 0.21 0.00 0.00 0.00 0.05 0.01 0.00 0.00
26_harrison_1889 0.06 0.00 0.17 0.00 0.02 0.03 0.01 0.04 0.00 0.00
27_cleveland_1893 0.15 0.00 0.22 0.00 0.00 0.00 0.00 0.04 0.00 0.00
28_mckinley_1897 0.16 0.00 0.16 0.00 0.05 0.03 0.06 0.02 0.00 0.00
29_mckinley_1901 0.15 0.00 0.12 0.00 0.08 0.04 0.01 0.01 0.00 0.00
30_roosevelt_theodore_1905 0.05 0.00 0.10 0.00 0.00 0.00 0.00 0.08 0.00 0.10
31_taft_1909 0.12 0.00 0.03 0.00 0.03 0.01 0.03 0.03 0.00 0.00
32_wilson_1913 0.11 0.00 0.02 0.00 0.00 0.00 0.00 0.14 0.10 0.06
33_wilson_1917 0.00 0.00 0.08 0.00 0.07 0.00 0.00 0.05 0.00 0.00
34_harding_1921 0.08 0.00 0.05 0.00 0.12 0.00 0.00 0.01 0.00 0.00
35_coolidge_1925 0.10 0.00 0.10 0.00 0.02 0.01 0.02 0.02 0.02 0.00
36_hoover_1929 0.20 0.04 0.10 0.00 0.01 0.00 0.01 0.03 0.03 0.01
37_roosevelt_franklin_1933 0.03 0.00 0.08 0.00 0.02 0.02 0.03 0.02 0.00 0.00
38_roosevelt_franklin_1937 0.18 0.03 0.12 0.00 0.01 0.00 0.00 0.10 0.07 0.02
39_roosevelt_franklin_1941 0.05 0.00 0.08 0.00 0.00 0.00 0.00 0.07 0.03 0.00
40_roosevelt_franklin_1945 0.00 0.00 0.02 0.00 0.05 0.03 0.00 0.10 0.05 0.04
41_truman_1949 0.03 0.00 0.10 0.00 0.02 0.01 0.01 0.06 0.00 0.00
42_eisenhower_1953 0.01 0.00 0.10 0.00 0.04 0.03 0.00 0.07 0.00 0.00
43_eisenhower_1957 0.00 0.00 0.10 0.00 0.01 0.05 0.00 0.07 0.00 0.00
44_kennedy_1961 0.00 0.00 0.01 0.00 0.06 0.00 0.00 0.01 0.00 0.00
45_johnson_1965 0.01 0.00 0.11 0.00 0.01 0.00 0.02 0.03 0.00 0.05
46_nixon_1969 0.05 0.00 0.13 0.00 0.03 0.03 0.00 0.01 0.00 0.00
47_nixon_1973 0.10 0.00 0.06 0.00 0.03 0.01 0.00 0.00 0.00 0.02
48_carter_1977 0.06 0.00 0.08 0.00 0.02 0.00 0.02 0.00 0.00 0.00
49_reagan_1981 0.16 0.00 0.08 0.00 0.01 0.00 0.00 0.02 0.04 0.09
50_reagan_1985 0.16 0.00 0.14 0.00 0.01 0.01 0.00 0.03 0.04 0.00
51_bush_george_h_w_1989 0.05 0.00 0.06 0.00 0.03 0.00 0.01 0.04 0.06 0.07
52_clinton_1993 0.05 0.00 0.13 0.00 0.03 0.00 0.02 0.01 0.02 0.06
53_clinton_1997 0.09 0.00 0.09 0.00 0.01 0.00 0.00 0.00 0.02 0.10
54_bush_george_w_2001 0.05 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.08
55_bush_george_w_2005 0.03 0.06 0.05 0.00 0.00 0.04 0.00 0.02 0.04 0.00
56_obama_2009 0.03 0.03 0.07 0.03 0.02 0.01 0.00 0.04 0.08 0.05
57_obama_2013 0.04 0.00 0.11 0.04 0.04 0.00 0.00 0.04 0.04 0.06
58_trump_2017 0.04 0.11 0.11 0.12 0.00 0.00 0.05 0.03 0.05 0.04

Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.

tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let’s reorganize the DataFrame so that the words are in rows rather than columns.

tfidf_df.stack().reset_index()
level_0 level_1 0
0 01_washington_1789 000 0.000000
1 01_washington_1789 03 0.000000
2 01_washington_1789 04 0.023259
3 01_washington_1789 05 0.000000
4 01_washington_1789 100 0.000000
... ... ... ...
530936 00_Document Frequency zachary 1.000000
530937 00_Document Frequency zeal 8.000000
530938 00_Document Frequency zealous 5.000000
530939 00_Document Frequency zealously 6.000000
530940 00_Document Frequency zone 1.000000

530941 rows × 3 columns

tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
document term tfidf
3707 01_washington_1789 government 0.113681
4108 01_washington_1789 immutable 0.103883
4175 01_washington_1789 impressions 0.103883
6337 01_washington_1789 providential 0.103883
5631 01_washington_1789 ought 0.103728
6351 01_washington_1789 public 0.103102
6117 01_washington_1789 present 0.097516
6389 01_washington_1789 qualifications 0.096372
5811 01_washington_1789 peculiarly 0.090546
653 01_washington_1789 article 0.085786
9018 02_washington_1793 1793 0.229350
9643 02_washington_1793 arrive 0.229350
17576 02_washington_1793 upbraidings 0.229350
13250 02_washington_1793 incurring 0.208140
17700 02_washington_1793 violated 0.208140
17872 02_washington_1793 willingly 0.208140
13368 02_washington_1793 injunctions 0.193091
13705 02_washington_1793 knowingly 0.193091
15157 02_washington_1793 previous 0.193091
17910 02_washington_1793 witnesses 0.193091
23821 03_adams_john_1797 people 0.191180
21705 03_adams_john_1797 government 0.160937
23959 03_adams_john_1797 pleasing 0.147066
21462 03_adams_john_1797 foreign 0.116874
23356 03_adams_john_1797 nations 0.114480
26711 03_adams_john_1797 virtuous 0.110813
22016 03_adams_john_1797 houses 0.110300
22799 03_adams_john_1797 legislatures 0.110300
19759 03_adams_john_1797 constitution 0.104525
21983 03_adams_john_1797 honor 0.102265
30704 04_jefferson_1801 government 0.155691
33171 04_jefferson_1801 principle 0.130113
31814 04_jefferson_1801 let 0.117970
34063 04_jefferson_1801 safety 0.108427
32010 04_jefferson_1801 man 0.106841
35095 04_jefferson_1801 thousandth 0.104513
30979 04_jefferson_1801 honest 0.101696
30327 04_jefferson_1801 fellow 0.097240
33893 04_jefferson_1801 retire 0.094848
32574 04_jefferson_1801 opinion 0.092587
42347 05_jefferson_1805 public 0.180456
39257 05_jefferson_1805 false 0.135863
43618 05_jefferson_1805 state 0.121514
44836 05_jefferson_1805 whatsoever 0.116886
40872 05_jefferson_1805 limits 0.107085
37368 05_jefferson_1805 citizens 0.106592
42486 05_jefferson_1805 reason 0.104438
37481 05_jefferson_1805 comforts 0.101880
42131 05_jefferson_1805 press 0.101549
39138 05_jefferson_1805 expenses 0.096524
49179 06_madison_1809 improvements 0.152559
45923 06_madison_1809 belligerent 0.123161
51346 06_madison_1809 public 0.122235
50353 06_madison_1809 nations 0.104588
51728 06_madison_1809 rendered 0.101706
45782 06_madison_1809 authorities 0.089155
45793 06_madison_1809 avail 0.089155
48041 06_madison_1809 examples 0.089155
46898 06_madison_1809 councils 0.085894
50558 06_madison_1809 ones 0.085894
62759 07_madison_1813 war 0.254249
55101 07_madison_1813 british 0.222972
59062 07_madison_1813 massacre 0.119009
55204 07_madison_1813 captives 0.108003
55976 07_madison_1813 cruel 0.108003
60175 07_madison_1813 prisoners 0.108003
61096 07_madison_1813 savage 0.108003
56740 07_madison_1813 element 0.085005
56841 07_madison_1813 enemy 0.085005
57980 07_madison_1813 honorable 0.084762
70620 08_monroe_1817 states 0.184195
66700 08_monroe_1817 government 0.174125
66734 08_monroe_1817 great 0.160658
71462 08_monroe_1817 union 0.117193
68816 08_monroe_1817 people 0.112825
71465 08_monroe_1817 united 0.112076
65026 08_monroe_1817 dangers 0.108567
68360 08_monroe_1817 naval 0.104713
66457 08_monroe_1817 foreign 0.103460
69168 08_monroe_1817 principles 0.097766
75733 09_monroe_1821 great 0.173751
79619 09_monroe_1821 states 0.137384
78909 09_monroe_1821 revenue 0.115018
80757 09_monroe_1821 war 0.113785
77742 09_monroe_1821 parties 0.109318
80464 09_monroe_1821 united 0.108029
73496 09_monroe_1821 commerce 0.105001
75444 09_monroe_1821 force 0.102947
75490 09_monroe_1821 fortifications 0.098741
80026 09_monroe_1821 term 0.094808
89460 10_adams_john_quincy_1825 union 0.257335
84698 10_adams_john_quincy_1825 government 0.147726
84633 10_adams_john_quincy_1825 general 0.109221
87960 10_adams_john_quincy_1825 rights 0.096300
83509 10_adams_john_quincy_1825 dissensions 0.095289
87342 10_adams_john_quincy_1825 public 0.094573
82752 10_adams_john_quincy_1825 constitution 0.090300
86792 10_adams_john_quincy_1825 peace 0.088183
82909 10_adams_john_quincy_1825 country 0.086898
86829 10_adams_john_quincy_1825 performance 0.085565
96341 11_jackson_1829 public 0.160747
93633 11_jackson_1829 generally 0.122711
92371 11_jackson_1829 diffidence 0.112691
92130 11_jackson_1829 defending 0.105878
97272 11_jackson_1829 shall 0.104933
96907 11_jackson_1829 revenue 0.102776
98938 11_jackson_1829 worth 0.100312
93697 11_jackson_1829 government 0.099698
93306 11_jackson_1829 federal 0.093100
96034 11_jackson_1829 power 0.092071
107458 12_jackson_1833 union 0.212766
102696 12_jackson_1833 government 0.207559
106616 12_jackson_1833 states 0.141549
104812 12_jackson_1833 people 0.136557
105112 12_jackson_1833 preservation 0.128319
102631 12_jackson_1833 general 0.125422
102081 12_jackson_1833 exercise 0.119275
103234 12_jackson_1833 inculcate 0.116720
105283 12_jackson_1833 proportion 0.116720
105036 12_jackson_1833 powers 0.113757
112415 13_van_buren_1837 institutions 0.186889
113811 13_van_buren_1837 people 0.138465
111695 13_van_buren_1837 government 0.116561
115860 13_van_buren_1837 supposed 0.109949
109906 13_van_buren_1837 country 0.109276
108231 13_van_buren_1837 actual 0.096382
111132 13_van_buren_1837 experience 0.093444
108255 13_van_buren_1837 adherence 0.083833
109627 13_van_buren_1837 conduct 0.081635
113566 13_van_buren_1837 opinions 0.081597
123031 14_harrison_1841 power 0.204207
118748 14_harrison_1841 constitution 0.183336
120073 14_harrison_1841 executive 0.157153
122810 14_harrison_1841 people 0.141584
120694 14_harrison_1841 government 0.141142
124000 14_harrison_1841 roman 0.110538
124614 14_harrison_1841 states 0.108621
118359 14_harrison_1841 citizens 0.105857
118292 14_harrison_1841 character 0.102640
124609 14_harrison_1841 state 0.094976
134455 15_polk_1845 union 0.259054
129693 15_polk_1845 government 0.256967
133613 15_polk_1845 states 0.218122
134041 15_polk_1845 texas 0.199846
132903 15_polk_1845 revenue 0.146541
132033 15_polk_1845 powers 0.124655
132307 15_polk_1845 protection 0.107385
127747 15_polk_1845 constitution 0.106528
130463 15_polk_1845 interests 0.105054
129169 15_polk_1845 extended 0.090179
142267 16_taylor_1849 shall 0.266204
138692 16_taylor_1849 government 0.118031
137649 16_taylor_1849 duties 0.117893
140457 16_taylor_1849 object 0.104293
136675 16_taylor_1849 congress 0.103865
141353 16_taylor_1849 purity 0.101793
143651 16_taylor_1849 vested 0.101793
140091 16_taylor_1849 measures 0.101637
136903 16_taylor_1849 country 0.101169
135322 16_taylor_1849 affections 0.097017
147825 17_pierce_1853 hardly 0.114001
150028 17_pierce_1853 power 0.102456
149999 17_pierce_1853 position 0.086643
145746 17_pierce_1853 constitutional 0.086105
147111 17_pierce_1853 expect 0.084436
147691 17_pierce_1853 government 0.084048
144526 17_pierce_1853 apparent 0.080332
150609 17_pierce_1853 regarded 0.080332
151266 17_pierce_1853 shall 0.079615
148847 17_pierce_1853 like 0.079229
160610 18_buchanan_1857 states 0.208199
154744 18_buchanan_1857 constitution 0.188573
160265 18_buchanan_1857 shall 0.161784
159381 18_buchanan_1857 question 0.157007
161827 18_buchanan_1857 whilst 0.141119
161028 18_buchanan_1857 territory 0.140852
161452 18_buchanan_1857 union 0.126444
156690 18_buchanan_1857 government 0.119554
154673 18_buchanan_1857 congress 0.118357
158806 18_buchanan_1857 people 0.105501
163743 19_lincoln_1861 constitution 0.214478
170451 19_lincoln_1861 union 0.203738
163215 19_lincoln_1861 case 0.152422
169609 19_lincoln_1861 states 0.144861
167186 19_lincoln_1861 minority 0.131514
167805 19_lincoln_1861 people 0.130763
163377 19_lincoln_1861 clause 0.125738
165689 19_lincoln_1861 government 0.123837
169264 19_lincoln_1861 shall 0.123099
166740 19_lincoln_1861 law 0.122872
179746 20_lincoln_1865 war 0.267217
176516 20_lincoln_1865 offenses 0.234524
179894 20_lincoln_1865 woe 0.234524
174672 20_lincoln_1865 god 0.151269
176515 20_lincoln_1865 offense 0.141890
179856 20_lincoln_1865 wills 0.141890
171492 20_lincoln_1865 answered 0.131631
178396 20_lincoln_1865 slaves 0.123674
179450 20_lincoln_1865 union 0.114955
171426 20_lincoln_1865 altogether 0.111675
182572 21_grant_1869 dollar 0.270439
185776 21_grant_1869 paying 0.162263
182035 21_grant_1869 deal 0.152454
187507 21_grant_1869 specie 0.152454
182050 21_grant_1869 debt 0.135097
181898 21_grant_1869 country 0.127604
180302 21_grant_1869 advisable 0.116606
184745 21_grant_1869 laws 0.115834
185778 21_grant_1869 payments 0.108175
185774 21_grant_1869 pay 0.098658
195281 22_grant_1873 proposition 0.187222
191582 22_grant_1873 domingo 0.177516
196070 22_grant_1873 santo 0.177516
197205 22_grant_1873 transit 0.177516
197024 22_grant_1873 territory 0.121158
192172 22_grant_1873 extermination 0.118344
196629 22_grant_1873 steam 0.118344
196981 22_grant_1873 telegraph 0.118344
190897 22_grant_1873 country 0.117529
192165 22_grant_1873 extension 0.116618
199896 23_hayes_1877 country 0.186357
201685 23_hayes_1877 government 0.167722
198890 23_hayes_1877 behalf 0.128316
204329 23_hayes_1877 public 0.123944
203966 23_hayes_1877 political 0.121034
205605 23_hayes_1877 states 0.113587
203735 23_hayes_1877 party 0.112549
200483 23_hayes_1877 dispute 0.112503
203728 23_hayes_1877 parties 0.109554
204583 23_hayes_1877 reform 0.104365
210684 24_garfield_1881 government 0.186855
212800 24_garfield_1881 people 0.162132
208738 24_garfield_1881 constitution 0.158292
214604 24_garfield_1881 states 0.135047
215446 24_garfield_1881 union 0.132321
214787 24_garfield_1881 suffrage 0.119992
212377 24_garfield_1881 negro 0.118782
207765 24_garfield_1881 authority 0.117232
208667 24_garfield_1881 congress 0.112598
211735 24_garfield_1881 law 0.103639
221799 25_cleveland_1885 people 0.210468
219683 25_cleveland_1885 government 0.209164
221727 25_cleveland_1885 partisan 0.169436
222327 25_cleveland_1885 public 0.163662
223258 25_cleveland_1885 shall 0.129498
217737 25_cleveland_1885 constitution 0.127856
220453 25_cleveland_1885 interests 0.118207
219180 25_cleveland_1885 extravagance 0.111416
217346 25_cleveland_1885 citizen 0.102825
223691 25_cleveland_1885 strife 0.101661
230798 26_harrison_1889 people 0.172358
229740 26_harrison_1889 laws 0.154418
232602 26_harrison_1889 states 0.138614
225821 26_harrison_1889 ballot 0.137159
231326 26_harrison_1889 public 0.128566
230136 26_harrison_1889 methods 0.119162
232257 26_harrison_1889 shall 0.118483
228544 26_harrison_1889 friendly 0.104267
227985 26_harrison_1889 european 0.103349
226736 26_harrison_1889 constitution 0.089360
239797 27_cleveland_1893 people 0.221563
237681 27_cleveland_1893 government 0.148364
237556 27_cleveland_1893 frugality 0.128050
240325 27_cleveland_1893 public 0.102520
241226 27_cleveland_1893 service 0.101813
241840 27_cleveland_1893 support 0.099946
234436 27_cleveland_1893 american 0.097267
234215 27_cleveland_1893 activity 0.095964
237682 27_cleveland_1893 governmental 0.095964
235893 27_cleveland_1893 countrymen 0.088564
244663 28_mckinley_1897 congress 0.188773
249890 28_mckinley_1897 revenue 0.168489
248796 28_mckinley_1897 people 0.161797
246680 28_mckinley_1897 government 0.156633
247872 28_mckinley_1897 loans 0.149356
247770 28_mckinley_1897 legislation 0.126367
249324 28_mckinley_1897 public 0.107057
244127 28_mckinley_1897 business 0.106759
246714 28_mckinley_1897 great 0.105322
249904 28_mckinley_1897 revision 0.099571
256568 29_mckinley_1901 islands 0.216480
253961 29_mckinley_1901 cuba 0.206329
255679 29_mckinley_1901 government 0.153681
255058 29_mckinley_1901 executive 0.147843
256326 29_mckinley_1901 inhabitants 0.147374
253662 29_mckinley_1901 congress 0.141999
257795 29_mckinley_1901 people 0.116839
259599 29_mckinley_1901 states 0.102186
260444 29_mckinley_1901 united 0.100439
258071 29_mckinley_1901 preparation 0.097925
267599 30_roosevelt_theodore_1905 regards 0.199163
267166 30_roosevelt_theodore_1905 problems 0.182463
268947 30_roosevelt_theodore_1905 tasks 0.150068
261589 30_roosevelt_theodore_1905 aright 0.146306
267754 30_roosevelt_theodore_1905 republic 0.121428
265817 30_roosevelt_theodore_1905 life 0.118701
262219 30_roosevelt_theodore_1905 cause 0.116483
264188 30_roosevelt_theodore_1905 faced 0.115730
262608 30_roosevelt_theodore_1905 conditions 0.115373
269866 30_roosevelt_theodore_1905 wish 0.106771
274471 31_taft_1909 interstate 0.206957
271124 31_taft_1909 business 0.201378
277940 31_taft_1909 tariff 0.154802
275370 31_taft_1909 negro 0.153669
277469 31_taft_1909 south 0.129384
273677 31_taft_1909 government 0.121451
276256 31_taft_1909 proper 0.114684
276381 31_taft_1909 race 0.113413
273292 31_taft_1909 feeling 0.111255
271156 31_taft_1909 canal 0.110879
282710 32_wilson_1913 great 0.158659
284100 32_wilson_1913 men 0.142924
282236 32_wilson_1913 familiar 0.141669
286642 32_wilson_1913 stirred 0.141669
286709 32_wilson_1913 studied 0.141669
287046 32_wilson_1913 things 0.123915
283635 32_wilson_1913 justice 0.105759
282676 32_wilson_1913 government 0.105520
283815 32_wilson_1913 life 0.102168
283892 32_wilson_1913 look 0.100095
296864 33_wilson_1917 wished 0.228593
289872 33_wilson_1917 counsel 0.174639
294339 33_wilson_1917 purpose 0.152933
288203 33_wilson_1917 action 0.149960
295250 33_wilson_1917 shall 0.134404
296059 33_wilson_1917 thought 0.126568
295577 33_wilson_1917 stand 0.121313
295227 33_wilson_1917 set 0.111215
293960 33_wilson_1917 politics 0.108510
290605 33_wilson_1917 drawn 0.104783
305905 34_harding_1921 world 0.196268
298346 34_harding_1921 civilization 0.157095
297428 34_harding_1921 america 0.155684
305732 34_harding_1921 war 0.120631
303639 34_harding_1921 relationship 0.118846
303750 34_harding_1921 republic 0.117160
302572 34_harding_1921 order 0.110128
305356 34_harding_1921 understanding 0.109915
302380 34_harding_1921 new 0.097567
297436 34_harding_1921 amid 0.095077
307884 35_coolidge_1925 country 0.120814
311597 35_coolidge_1925 ought 0.116721
312745 35_coolidge_1925 represents 0.114495
313945 35_coolidge_1925 tax 0.112826
309707 35_coolidge_1925 great 0.109908
312254 35_coolidge_1925 property 0.108285
311723 35_coolidge_1925 party 0.107300
313580 35_coolidge_1925 stands 0.107257
311767 35_coolidge_1925 peace 0.104170
311789 35_coolidge_1925 people 0.101306
322807 36_hoover_1929 sup 0.296865
318672 36_hoover_1929 government 0.202690
317821 36_hoover_1929 enforcement 0.194371
315025 36_hoover_1929 18th 0.134706
321207 36_hoover_1929 progress 0.132406
318281 36_hoover_1929 federal 0.126183
319026 36_hoover_1929 ideals 0.113418
316119 36_hoover_1929 business 0.108323
319730 36_hoover_1929 laws 0.107051
320766 36_hoover_1929 peace 0.103883
327891 37_roosevelt_franklin_1933 helped 0.215644
328736 37_roosevelt_franklin_1933 leadership 0.191084
331673 37_roosevelt_franklin_1933 stricken 0.129390
326744 37_roosevelt_franklin_1933 emergency 0.123225
326401 37_roosevelt_franklin_1933 discipline 0.117971
330806 37_roosevelt_franklin_1933 respects 0.117971
329235 37_roosevelt_franklin_1933 money 0.113371
329318 37_roosevelt_franklin_1933 national 0.110570
330522 37_roosevelt_franklin_1933 recovery 0.102349
324199 37_roosevelt_franklin_1933 action 0.097007
335172 38_roosevelt_franklin_1937 democracy 0.178041
336670 38_roosevelt_franklin_1937 government 0.177222
338146 38_roosevelt_franklin_1937 millions 0.140722
338671 38_roosevelt_franklin_1937 paint 0.121461
338786 38_roosevelt_franklin_1937 people 0.115789
335664 38_roosevelt_franklin_1937 economic 0.114184
339954 38_roosevelt_franklin_1937 road 0.112944
339205 38_roosevelt_franklin_1937 progress 0.104463
335255 38_roosevelt_franklin_1937 despair 0.100302
338316 38_roosevelt_franklin_1937 nation 0.099688
344171 39_roosevelt_franklin_1941 democracy 0.244486
346666 39_roosevelt_franklin_1941 know 0.189060
349486 39_roosevelt_franklin_1941 speaks 0.183385
343032 39_roosevelt_franklin_1941 br 0.163241
347315 39_roosevelt_franklin_1941 nation 0.162241
342423 39_roosevelt_franklin_1941 america 0.140133
346808 39_roosevelt_franklin_1941 life 0.117815
349517 39_roosevelt_franklin_1941 spirit 0.114445
345513 39_roosevelt_franklin_1941 freedom 0.109295
342035 39_roosevelt_franklin_1941 1941 0.108911
355738 40_roosevelt_franklin_1945 learned 0.300396
359010 40_roosevelt_franklin_1945 test 0.194731
351035 40_roosevelt_franklin_1945 1945 0.189849
358243 40_roosevelt_franklin_1945 shall 0.173637
359224 40_roosevelt_franklin_1945 trend 0.172292
356197 40_roosevelt_franklin_1945 mistakes 0.159835
356762 40_roosevelt_franklin_1945 peace 0.159442
359109 40_roosevelt_franklin_1945 today 0.154299
359551 40_roosevelt_franklin_1945 upward 0.150172
354582 40_roosevelt_franklin_1945 gain 0.142278
368898 41_truman_1949 world 0.196051
365318 41_truman_1949 nations 0.194029
366200 41_truman_1949 program 0.171656
365784 41_truman_1949 peoples 0.166989
362169 41_truman_1949 democracy 0.154140
363511 41_truman_1949 freedom 0.149297
361488 41_truman_1949 communism 0.147134
365761 41_truman_1949 peace 0.144167
361877 41_truman_1949 countries 0.137013
366518 41_truman_1949 recovery 0.135785
372507 42_eisenhower_1953 free 0.205803
372209 42_eisenhower_1953 faith 0.154561
377897 42_eisenhower_1953 world 0.146449
374783 42_eisenhower_1953 peoples 0.139466
375182 42_eisenhower_1953 productivity 0.133826
376658 42_eisenhower_1953 strength 0.130430
374760 42_eisenhower_1953 peace 0.123845
372510 42_eisenhower_1953 freedom 0.123320
376241 42_eisenhower_1953 shall 0.105970
372929 42_eisenhower_1953 hold 0.105028
386896 43_eisenhower_1957 world 0.193893
381509 43_eisenhower_1957 freedom 0.179599
385154 43_eisenhower_1957 seek 0.176008
383316 43_eisenhower_1957 nations 0.175538
383782 43_eisenhower_1957 peoples 0.158270
385681 43_eisenhower_1957 strives 0.146428
383759 43_eisenhower_1957 peace 0.136639
381884 43_eisenhower_1957 help 0.132688
380526 43_eisenhower_1957 divided 0.115450
383273 43_eisenhower_1957 mr 0.111886
391774 44_kennedy_1961 let 0.267869
394306 44_kennedy_1961 sides 0.262849
392921 44_kennedy_1961 pledge 0.160960
387632 44_kennedy_1961 ask 0.107713
387864 44_kennedy_1961 begin 0.106495
388991 44_kennedy_1961 dare 0.106495
395895 44_kennedy_1961 world 0.103110
390313 44_kennedy_1961 final 0.102311
392370 44_kennedy_1961 new 0.096600
390120 44_kennedy_1961 explore 0.094223
397251 45_johnson_1965 change 0.276090
397887 45_johnson_1965 covenant 0.242891
400969 45_johnson_1965 man 0.174391
401031 45_johnson_1965 mastery 0.153532
401309 45_johnson_1965 nation 0.152475
404425 45_johnson_1965 union 0.150512
401508 45_johnson_1965 old 0.129184
404267 45_johnson_1965 trying 0.109663
401779 45_johnson_1965 people 0.108677
399814 45_johnson_1965 harvest 0.102355
413682 46_nixon_1969 voices 0.208854
410756 46_nixon_1969 peace 0.144624
409772 46_nixon_1969 let 0.140977
407641 46_nixon_1969 earth 0.139513
409659 46_nixon_1969 know 0.137969
409968 46_nixon_1969 man 0.135416
410778 46_nixon_1969 people 0.131270
413893 46_nixon_1969 world 0.128264
411901 46_nixon_1969 rhetoric 0.119219
408467 46_nixon_1969 forward 0.113215
414415 47_nixon_1973 america 0.307074
418771 47_nixon_1973 let 0.282212
419755 47_nixon_1973 peace 0.211567
420963 47_nixon_1973 role 0.190395
422892 47_nixon_1973 world 0.177760
419938 47_nixon_1973 policies 0.176224
420803 47_nixon_1973 responsibility 0.164016
419367 47_nixon_1973 new 0.158606
414102 47_nixon_1973 abroad 0.154815
417932 47_nixon_1973 home 0.126653
424023 48_carter_1977 br 0.222574
428306 48_carter_1977 nation 0.191717
425593 48_carter_1977 dream 0.181515
430652 48_carter_1977 strength 0.147104
428366 48_carter_1977 new 0.142111
428117 48_carter_1977 micah 0.118797
431014 48_carter_1977 thee 0.107811
430508 48_carter_1977 spirit 0.107000
426975 48_carter_1977 human 0.101203
425827 48_carter_1977 enhance 0.100016
435659 49_reagan_1981 government 0.162397
432416 49_reagan_1981 americans 0.156895
435894 49_reagan_1981 heroes 0.137410
432873 49_reagan_1981 believe 0.136126
440605 49_reagan_1981 ve 0.115339
438175 49_reagan_1981 productivity 0.104753
440770 49_reagan_1981 weapon 0.104753
435503 49_reagan_1981 freedom 0.102964
434594 49_reagan_1981 dreams 0.101106
440100 49_reagan_1981 today 0.093813
444658 50_reagan_1985 government 0.161165
444502 50_reagan_1985 freedom 0.159998
446405 50_reagan_1985 nuclear 0.153623
449604 50_reagan_1985 ve 0.153623
449770 50_reagan_1985 weapons 0.140173
446774 50_reagan_1985 people 0.137038
449889 50_reagan_1985 world 0.127236
444916 50_reagan_1985 history 0.104777
444973 50_reagan_1985 human 0.104777
448172 50_reagan_1985 senator 0.102416
452556 51_bush_george_h_w_1989 don 0.186313
451038 51_bush_george_h_w_1989 breeze 0.184416
455363 51_bush_george_h_w_1989 new 0.137266
453520 51_bush_george_h_w_1989 friends 0.136820
452560 51_bush_george_h_w_1989 door 0.133889
458875 51_bush_george_h_w_1989 word 0.131722
455265 51_bush_george_h_w_1989 mr 0.126821
453763 51_bush_george_h_w_1989 hand 0.125086
450968 51_bush_george_h_w_1989 blowing 0.110649
458027 51_bush_george_h_w_1989 things 0.110609
459410 52_clinton_1993 america 0.318908
467887 52_clinton_1993 world 0.226715
459413 52_clinton_1993 americans 0.206865
467097 52_clinton_1993 today 0.185539
460244 52_clinton_1993 change 0.170522
465686 52_clinton_1993 renewal 0.136867
466112 52_clinton_1993 season 0.136867
463005 52_clinton_1993 idea 0.134993
463766 52_clinton_1993 let 0.132521
464772 52_clinton_1993 people 0.129272
469224 53_clinton_1997 century 0.321300
473361 53_clinton_1997 new 0.279600
468409 53_clinton_1997 america 0.199997
474208 53_clinton_1997 promise 0.164327
476886 53_clinton_1997 world 0.135071
472675 53_clinton_1997 land 0.131027
473301 53_clinton_1997 nation 0.117062
468412 53_clinton_1997 americans 0.115029
476082 53_clinton_1997 time 0.108057
472765 53_clinton_1997 let 0.105270
484633 54_bush_george_w_2001 story 0.341166
477408 54_bush_george_w_2001 america 0.193152
478325 54_bush_george_w_2001 civility 0.160853
482300 54_bush_george_w_2001 nation 0.130448
477286 54_bush_george_w_2001 affirm 0.120640
481008 54_bush_george_w_2001 ideals 0.109491
477411 54_bush_george_w_2001 americans 0.108207
483207 54_bush_george_w_2001 promise 0.108207
478491 54_bush_george_w_2001 compassion 0.107388
478319 54_bush_george_w_2001 citizens 0.106730
489497 55_bush_george_w_2005 freedom 0.349948
486407 55_bush_george_w_2005 america 0.284882
490786 55_bush_george_w_2005 liberty 0.174494
486410 55_bush_george_w_2005 americans 0.140443
494272 55_bush_george_w_2005 tyranny 0.127272
493148 55_bush_george_w_2005 seen 0.110386
491299 55_bush_george_w_2005 nation 0.096199
487194 55_bush_george_w_2005 cause 0.092545
489911 55_bush_george_w_2005 history 0.092422
487126 55_bush_george_w_2005 came 0.091988
495406 56_obama_2009 america 0.148351
500298 56_obama_2009 nation 0.120229
500358 56_obama_2009 new 0.118002
503093 56_obama_2009 today 0.114792
498590 56_obama_2009 generation 0.100654
499762 56_obama_2009 let 0.091100
499578 56_obama_2009 jobs 0.090727
496911 56_obama_2009 crisis 0.087235
498779 56_obama_2009 hard 0.084859
503861 56_obama_2009 women 0.084859
508585 57_obama_2013 journey 0.167591
505899 57_obama_2013 creed 0.139659
507589 57_obama_2013 generation 0.127260
504405 57_obama_2013 america 0.125044
505509 57_obama_2013 complete 0.114891
510741 57_obama_2013 requires 0.114891
509767 57_obama_2013 people 0.110351
512078 57_obama_2013 time 0.105563
512092 57_obama_2013 today 0.103668
506970 57_obama_2013 evident 0.100896
513404 58_trump_2017 america 0.350162
515585 58_trump_2017 dreams 0.156436
513405 58_trump_2017 american 0.149226
517576 58_trump_2017 jobs 0.142766
519262 58_trump_2017 protected 0.132439
518409 58_trump_2017 obama 0.120288
518766 58_trump_2017 people 0.112370
521001 58_trump_2017 thank 0.109171
513989 58_trump_2017 borders 0.107075
521596 58_trump_2017 ve 0.107075
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

We can zoom in on particular words and particular documents.

top_tfidf[top_tfidf['term'].str.contains('women')]
document term tfidf
503861 56_obama_2009 women 0.084859

It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.

top_tfidf[top_tfidf['document'].str.contains('obama')]
document term tfidf
495406 56_obama_2009 america 0.148351
500298 56_obama_2009 nation 0.120229
500358 56_obama_2009 new 0.118002
503093 56_obama_2009 today 0.114792
498590 56_obama_2009 generation 0.100654
499762 56_obama_2009 let 0.091100
499578 56_obama_2009 jobs 0.090727
496911 56_obama_2009 crisis 0.087235
498779 56_obama_2009 hard 0.084859
503861 56_obama_2009 women 0.084859
508585 57_obama_2013 journey 0.167591
505899 57_obama_2013 creed 0.139659
507589 57_obama_2013 generation 0.127260
504405 57_obama_2013 america 0.125044
505509 57_obama_2013 complete 0.114891
510741 57_obama_2013 requires 0.114891
509767 57_obama_2013 people 0.110351
512078 57_obama_2013 time 0.105563
512092 57_obama_2013 today 0.103668
506970 57_obama_2013 evident 0.100896
top_tfidf[top_tfidf['document'].str.contains('trump')]
document term tfidf
513404 58_trump_2017 america 0.350162
515585 58_trump_2017 dreams 0.156436
513405 58_trump_2017 american 0.149226
517576 58_trump_2017 jobs 0.142766
519262 58_trump_2017 protected 0.132439
518409 58_trump_2017 obama 0.120288
518766 58_trump_2017 people 0.112370
521001 58_trump_2017 thank 0.109171
513989 58_trump_2017 borders 0.107075
521596 58_trump_2017 ve 0.107075
top_tfidf[top_tfidf['document'].str.contains('kennedy')]
document term tfidf
391774 44_kennedy_1961 let 0.267869
394306 44_kennedy_1961 sides 0.262849
392921 44_kennedy_1961 pledge 0.160960
387632 44_kennedy_1961 ask 0.107713
387864 44_kennedy_1961 begin 0.106495
388991 44_kennedy_1961 dare 0.106495
395895 44_kennedy_1961 world 0.103110
390313 44_kennedy_1961 final 0.102311
392370 44_kennedy_1961 new 0.096600
390120 44_kennedy_1961 explore 0.094223

Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

1. What is the difference between a tf-idf score and raw word frequency?

Your answer here

2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

Your answer here

3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?

Your answer here