TF-IDF with Scikit-Learn

TF-IDF with Scikit-Learn#

In the previous lesson, we learned about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf. Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. We specifically learned how to calculate tf-idf scores using word frequencies per page—or “extracted features”—made available by the HathiTrust Digital Library.

In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.

In this lesson, we will cover how to:

Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn

Dataset#

U.S. Inaugural Addresses#

This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.

—Barack Obama, Inaugural Presidential Address, January 2009

During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.

Breaking Down the TF-IDF Formula#

But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.

tf-idf = term_frequency * inverse_document_frequency

term_frequency = number of times a given term appears in document

inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:

inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

TF-IDF with scikit-learn#

scikit-learn, imported as sklearn, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer and CountVectorizer.

Install scikit-learn

!pip install sklearn

Import necessary modules and libraries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
from pathlib import Path  
import glob

Pandas Review

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

We’re also going to import pandas and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: pathlib and glob.

Set Directory Path#

Below we’re setting the directory filepath that contains all the text files that we want to analyze.

directory_path = "../texts/history/US_Inaugural_Addresses/"

Then we’re going to use glob and Path to make a list of all the filepaths in that directory and a list of all the short story titles.

text_files = glob.glob(f"{directory_path}/*.txt")

text_files

['../texts/history/US_Inaugural_Addresses/13_van_buren_1837.txt',
 '../texts/history/US_Inaugural_Addresses/47_nixon_1973.txt',
 '../texts/history/US_Inaugural_Addresses/50_reagan_1985.txt',
 '../texts/history/US_Inaugural_Addresses/53_clinton_1997.txt',
 '../texts/history/US_Inaugural_Addresses/17_pierce_1853.txt',
 '../texts/history/US_Inaugural_Addresses/14_harrison_1841.txt',
 '../texts/history/US_Inaugural_Addresses/56_obama_2009.txt',
 '../texts/history/US_Inaugural_Addresses/25_cleveland_1885.txt',
 '../texts/history/US_Inaugural_Addresses/03_adams_john_1797.txt',
 '../texts/history/US_Inaugural_Addresses/12_jackson_1833.txt',
 '../texts/history/US_Inaugural_Addresses/11_jackson_1829.txt',
 '../texts/history/US_Inaugural_Addresses/36_hoover_1929.txt',
 '../texts/history/US_Inaugural_Addresses/45_johnson_1965.txt',
 '../texts/history/US_Inaugural_Addresses/51_bush_george_h_w_1989.txt',
 '../texts/history/US_Inaugural_Addresses/21_grant_1869.txt',
 '../texts/history/US_Inaugural_Addresses/41_truman_1949.txt',
 '../texts/history/US_Inaugural_Addresses/33_wilson_1917.txt',
 '../texts/history/US_Inaugural_Addresses/49_reagan_1981.txt',
 '../texts/history/US_Inaugural_Addresses/30_roosevelt_theodore_1905.txt',
 '../texts/history/US_Inaugural_Addresses/07_madison_1813.txt',
 '../texts/history/US_Inaugural_Addresses/09_monroe_1821.txt',
 '../texts/history/US_Inaugural_Addresses/48_carter_1977.txt',
 '../texts/history/US_Inaugural_Addresses/32_wilson_1913.txt',
 '../texts/history/US_Inaugural_Addresses/19_lincoln_1861.txt',
 '../texts/history/US_Inaugural_Addresses/01_washington_1789.txt',
 '../texts/history/US_Inaugural_Addresses/29_mckinley_1901.txt',
 '../texts/history/US_Inaugural_Addresses/04_jefferson_1801.txt',
 '../texts/history/US_Inaugural_Addresses/34_harding_1921.txt',
 '../texts/history/US_Inaugural_Addresses/52_clinton_1993.txt',
 '../texts/history/US_Inaugural_Addresses/35_coolidge_1925.txt',
 '../texts/history/US_Inaugural_Addresses/39_roosevelt_franklin_1941.txt',
 '../texts/history/US_Inaugural_Addresses/28_mckinley_1897.txt',
 '../texts/history/US_Inaugural_Addresses/24_garfield_1881.txt',
 '../texts/history/US_Inaugural_Addresses/22_grant_1873.txt',
 '../texts/history/US_Inaugural_Addresses/15_polk_1845.txt',
 '../texts/history/US_Inaugural_Addresses/54_bush_george_w_2001.txt',
 '../texts/history/US_Inaugural_Addresses/02_washington_1793.txt',
 '../texts/history/US_Inaugural_Addresses/38_roosevelt_franklin_1937.txt',
 '../texts/history/US_Inaugural_Addresses/37_roosevelt_franklin_1933.txt',
 '../texts/history/US_Inaugural_Addresses/18_buchanan_1857.txt',
 '../texts/history/US_Inaugural_Addresses/16_taylor_1849.txt',
 '../texts/history/US_Inaugural_Addresses/05_jefferson_1805.txt',
 '../texts/history/US_Inaugural_Addresses/26_harrison_1889.txt',
 '../texts/history/US_Inaugural_Addresses/44_kennedy_1961.txt',
 '../texts/history/US_Inaugural_Addresses/23_hayes_1877.txt',
 '../texts/history/US_Inaugural_Addresses/20_lincoln_1865.txt',
 '../texts/history/US_Inaugural_Addresses/57_obama_2013.txt',
 '../texts/history/US_Inaugural_Addresses/10_adams_john_quincy_1825.txt',
 '../texts/history/US_Inaugural_Addresses/55_bush_george_w_2005.txt',
 '../texts/history/US_Inaugural_Addresses/27_cleveland_1893.txt',
 '../texts/history/US_Inaugural_Addresses/46_nixon_1969.txt',
 '../texts/history/US_Inaugural_Addresses/42_eisenhower_1953.txt',
 '../texts/history/US_Inaugural_Addresses/40_roosevelt_franklin_1945.txt',
 '../texts/history/US_Inaugural_Addresses/43_eisenhower_1957.txt',
 '../texts/history/US_Inaugural_Addresses/08_monroe_1817.txt',
 '../texts/history/US_Inaugural_Addresses/06_madison_1809.txt',
 '../texts/history/US_Inaugural_Addresses/58_trump_2017.txt',
 '../texts/history/US_Inaugural_Addresses/31_taft_1909.txt']

text_titles = [Path(text).stem for text in text_files]

text_titles

['13_van_buren_1837',
 '47_nixon_1973',
 '50_reagan_1985',
 '53_clinton_1997',
 '17_pierce_1853',
 '14_harrison_1841',
 '56_obama_2009',
 '25_cleveland_1885',
 '03_adams_john_1797',
 '12_jackson_1833',
 '11_jackson_1829',
 '36_hoover_1929',
 '45_johnson_1965',
 '51_bush_george_h_w_1989',
 '21_grant_1869',
 '41_truman_1949',
 '33_wilson_1917',
 '49_reagan_1981',
 '30_roosevelt_theodore_1905',
 '07_madison_1813',
 '09_monroe_1821',
 '48_carter_1977',
 '32_wilson_1913',
 '19_lincoln_1861',
 '01_washington_1789',
 '29_mckinley_1901',
 '04_jefferson_1801',
 '34_harding_1921',
 '52_clinton_1993',
 '35_coolidge_1925',
 '39_roosevelt_franklin_1941',
 '28_mckinley_1897',
 '24_garfield_1881',
 '22_grant_1873',
 '15_polk_1845',
 '54_bush_george_w_2001',
 '02_washington_1793',
 '38_roosevelt_franklin_1937',
 '37_roosevelt_franklin_1933',
 '18_buchanan_1857',
 '16_taylor_1849',
 '05_jefferson_1805',
 '26_harrison_1889',
 '44_kennedy_1961',
 '23_hayes_1877',
 '20_lincoln_1865',
 '57_obama_2013',
 '10_adams_john_quincy_1825',
 '55_bush_george_w_2005',
 '27_cleveland_1893',
 '46_nixon_1969',
 '42_eisenhower_1953',
 '40_roosevelt_franklin_1945',
 '43_eisenhower_1957',
 '08_monroe_1817',
 '06_madison_1809',
 '58_trump_2017',
 '31_taft_1909']

Calculate tf–idf#

To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer.

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run TfidfVectorizer is with smoothing (smooth_idf = True) and normalization (norm='l2') turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer, so to turn them on, you don’t need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our text_files

tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows

tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())

Add column for document frequency aka number of times word appears in all documents

tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()

tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)

	government	borders	people	obama	war	honor	foreign	men	women	children
00_Document Frequency	53.00	5.00	56.00	3.00	45.00	32.00	32.00	47.00	15.00	22.00
01_washington_1789	0.11	0.00	0.05	0.00	0.00	0.00	0.00	0.02	0.00	0.00
02_washington_1793	0.06	0.00	0.05	0.00	0.00	0.08	0.00	0.00	0.00	0.00
03_adams_john_1797	0.16	0.00	0.19	0.00	0.01	0.10	0.12	0.04	0.00	0.00
04_jefferson_1801	0.16	0.00	0.01	0.00	0.01	0.04	0.00	0.04	0.00	0.00
05_jefferson_1805	0.03	0.00	0.00	0.00	0.04	0.00	0.06	0.01	0.00	0.02
06_madison_1809	0.00	0.00	0.02	0.00	0.02	0.05	0.05	0.00	0.00	0.00
07_madison_1813	0.04	0.00	0.04	0.00	0.25	0.02	0.02	0.00	0.00	0.00
08_monroe_1817	0.17	0.00	0.11	0.00	0.09	0.01	0.10	0.04	0.00	0.00
09_monroe_1821	0.08	0.00	0.06	0.00	0.11	0.02	0.04	0.01	0.00	0.01
10_adams_john_quincy_1825	0.15	0.00	0.06	0.00	0.05	0.01	0.08	0.03	0.00	0.00
11_jackson_1829	0.10	0.00	0.06	0.00	0.02	0.02	0.07	0.02	0.00	0.00
12_jackson_1833	0.21	0.00	0.14	0.00	0.00	0.00	0.02	0.00	0.00	0.00
13_van_buren_1837	0.12	0.00	0.14	0.00	0.02	0.02	0.06	0.02	0.00	0.01
14_harrison_1841	0.14	0.00	0.14	0.00	0.01	0.02	0.03	0.03	0.00	0.00
15_polk_1845	0.26	0.00	0.08	0.00	0.03	0.01	0.09	0.02	0.00	0.01
16_taylor_1849	0.12	0.00	0.05	0.00	0.00	0.02	0.05	0.00	0.00	0.00
17_pierce_1853	0.08	0.00	0.05	0.00	0.00	0.02	0.04	0.01	0.00	0.03
18_buchanan_1857	0.12	0.00	0.11	0.00	0.08	0.01	0.04	0.03	0.00	0.05
19_lincoln_1861	0.12	0.00	0.13	0.00	0.02	0.00	0.02	0.00	0.00	0.00
20_lincoln_1865	0.02	0.00	0.00	0.00	0.27	0.00	0.00	0.04	0.00	0.00
21_grant_1869	0.05	0.00	0.03	0.00	0.02	0.05	0.05	0.02	0.00	0.00
22_grant_1873	0.06	0.00	0.10	0.00	0.05	0.02	0.00	0.00	0.00	0.00
23_hayes_1877	0.17	0.00	0.08	0.00	0.00	0.00	0.04	0.02	0.00	0.00
24_garfield_1881	0.19	0.00	0.16	0.00	0.05	0.00	0.00	0.01	0.00	0.04
25_cleveland_1885	0.21	0.00	0.21	0.00	0.00	0.00	0.05	0.01	0.00	0.00
26_harrison_1889	0.06	0.00	0.17	0.00	0.02	0.03	0.01	0.04	0.00	0.00
27_cleveland_1893	0.15	0.00	0.22	0.00	0.00	0.00	0.00	0.04	0.00	0.00
28_mckinley_1897	0.16	0.00	0.16	0.00	0.05	0.03	0.06	0.02	0.00	0.00
29_mckinley_1901	0.15	0.00	0.12	0.00	0.08	0.04	0.01	0.01	0.00	0.00
30_roosevelt_theodore_1905	0.05	0.00	0.10	0.00	0.00	0.00	0.00	0.08	0.00	0.10
31_taft_1909	0.12	0.00	0.03	0.00	0.03	0.01	0.03	0.03	0.00	0.00
32_wilson_1913	0.11	0.00	0.02	0.00	0.00	0.00	0.00	0.14	0.10	0.06
33_wilson_1917	0.00	0.00	0.08	0.00	0.07	0.00	0.00	0.05	0.00	0.00
34_harding_1921	0.08	0.00	0.05	0.00	0.12	0.00	0.00	0.01	0.00	0.00
35_coolidge_1925	0.10	0.00	0.10	0.00	0.02	0.01	0.02	0.02	0.02	0.00
36_hoover_1929	0.20	0.04	0.10	0.00	0.01	0.00	0.01	0.03	0.03	0.01
37_roosevelt_franklin_1933	0.03	0.00	0.08	0.00	0.02	0.02	0.03	0.02	0.00	0.00
38_roosevelt_franklin_1937	0.18	0.03	0.12	0.00	0.01	0.00	0.00	0.10	0.07	0.02
39_roosevelt_franklin_1941	0.05	0.00	0.08	0.00	0.00	0.00	0.00	0.07	0.03	0.00
40_roosevelt_franklin_1945	0.00	0.00	0.02	0.00	0.05	0.03	0.00	0.10	0.05	0.04
41_truman_1949	0.03	0.00	0.10	0.00	0.02	0.01	0.01	0.06	0.00	0.00
42_eisenhower_1953	0.01	0.00	0.10	0.00	0.04	0.03	0.00	0.07	0.00	0.00
43_eisenhower_1957	0.00	0.00	0.10	0.00	0.01	0.05	0.00	0.07	0.00	0.00
44_kennedy_1961	0.00	0.00	0.01	0.00	0.06	0.00	0.00	0.01	0.00	0.00
45_johnson_1965	0.01	0.00	0.11	0.00	0.01	0.00	0.02	0.03	0.00	0.05
46_nixon_1969	0.05	0.00	0.13	0.00	0.03	0.03	0.00	0.01	0.00	0.00
47_nixon_1973	0.10	0.00	0.06	0.00	0.03	0.01	0.00	0.00	0.00	0.02
48_carter_1977	0.06	0.00	0.08	0.00	0.02	0.00	0.02	0.00	0.00	0.00
49_reagan_1981	0.16	0.00	0.08	0.00	0.01	0.00	0.00	0.02	0.04	0.09
50_reagan_1985	0.16	0.00	0.14	0.00	0.01	0.01	0.00	0.03	0.04	0.00
51_bush_george_h_w_1989	0.05	0.00	0.06	0.00	0.03	0.00	0.01	0.04	0.06	0.07
52_clinton_1993	0.05	0.00	0.13	0.00	0.03	0.00	0.02	0.01	0.02	0.06
53_clinton_1997	0.09	0.00	0.09	0.00	0.01	0.00	0.00	0.00	0.02	0.10
54_bush_george_w_2001	0.05	0.00	0.01	0.00	0.01	0.00	0.00	0.00	0.00	0.08
55_bush_george_w_2005	0.03	0.06	0.05	0.00	0.00	0.04	0.00	0.02	0.04	0.00
56_obama_2009	0.03	0.03	0.07	0.03	0.02	0.01	0.00	0.04	0.08	0.05
57_obama_2013	0.04	0.00	0.11	0.04	0.04	0.00	0.00	0.04	0.04	0.06
58_trump_2017	0.04	0.11	0.11	0.12	0.00	0.00	0.05	0.03	0.05	0.04

Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.

tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')

Let’s reorganize the DataFrame so that the words are in rows rather than columns.

tfidf_df.stack().reset_index()

	level_0	level_1	0
0	13_van_buren_1837	000	0.000000
1	13_van_buren_1837	03	0.011681
2	13_van_buren_1837	04	0.011924
3	13_van_buren_1837	05	0.000000
4	13_van_buren_1837	100	0.000000
...	...	...	...
521937	31_taft_1909	zachary	0.000000
521938	31_taft_1909	zeal	0.000000
521939	31_taft_1909	zealous	0.000000
521940	31_taft_1909	zealously	0.000000
521941	31_taft_1909	zone	0.000000

521942 rows × 3 columns

tfidf_df = tfidf_df.stack().reset_index()

tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.

tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

	document	term	tfidf
219683	01_washington_1789	government	0.113681
220084	01_washington_1789	immutable	0.103883
220151	01_washington_1789	impressions	0.103883
222313	01_washington_1789	providential	0.103883
221607	01_washington_1789	ought	0.103728
222327	01_washington_1789	public	0.103102
222093	01_washington_1789	present	0.097516
222365	01_washington_1789	qualifications	0.096372
221787	01_washington_1789	peculiarly	0.090546
216629	01_washington_1789	article	0.085786
323983	02_washington_1793	1793	0.229350
324608	02_washington_1793	arrive	0.229350
332541	02_washington_1793	upbraidings	0.229350
328215	02_washington_1793	incurring	0.208140
332665	02_washington_1793	violated	0.208140
332837	02_washington_1793	willingly	0.208140
328333	02_washington_1793	injunctions	0.193091
328670	02_washington_1793	knowingly	0.193091
330122	02_washington_1793	previous	0.193091
332875	02_washington_1793	witnesses	0.193091
77815	03_adams_john_1797	people	0.191180
75699	03_adams_john_1797	government	0.160937
77953	03_adams_john_1797	pleasing	0.147066
75456	03_adams_john_1797	foreign	0.116874
77350	03_adams_john_1797	nations	0.114480
80705	03_adams_john_1797	virtuous	0.110813
76010	03_adams_john_1797	houses	0.110300
76793	03_adams_john_1797	legislatures	0.110300
73753	03_adams_john_1797	constitution	0.104525
75977	03_adams_john_1797	honor	0.102265
237681	04_jefferson_1801	government	0.155691
240148	04_jefferson_1801	principle	0.130113
238791	04_jefferson_1801	let	0.117970
241040	04_jefferson_1801	safety	0.108427
238987	04_jefferson_1801	man	0.106841
242072	04_jefferson_1801	thousandth	0.104513
237956	04_jefferson_1801	honest	0.101696
237304	04_jefferson_1801	fellow	0.097240
240870	04_jefferson_1801	retire	0.094848
239551	04_jefferson_1801	opinion	0.092587
375310	05_jefferson_1805	public	0.180456
372220	05_jefferson_1805	false	0.135863
376581	05_jefferson_1805	state	0.121514
377799	05_jefferson_1805	whatsoever	0.116886
373835	05_jefferson_1805	limits	0.107085
370331	05_jefferson_1805	citizens	0.106592
375449	05_jefferson_1805	reason	0.104438
370444	05_jefferson_1805	comforts	0.101880
375094	05_jefferson_1805	press	0.101549
372101	05_jefferson_1805	expenses	0.096524
499129	06_madison_1809	improvements	0.152559
495873	06_madison_1809	belligerent	0.123161
501296	06_madison_1809	public	0.122235
500303	06_madison_1809	nations	0.104588
501678	06_madison_1809	rendered	0.101706
495732	06_madison_1809	authorities	0.089155
495743	06_madison_1809	avail	0.089155
497991	06_madison_1809	examples	0.089155
496848	06_madison_1809	councils	0.085894
500508	06_madison_1809	ones	0.085894
179746	07_madison_1813	war	0.254249
172088	07_madison_1813	british	0.222972
176049	07_madison_1813	massacre	0.119009
172191	07_madison_1813	captives	0.108003
172963	07_madison_1813	cruel	0.108003
177162	07_madison_1813	prisoners	0.108003
178083	07_madison_1813	savage	0.108003
173727	07_madison_1813	element	0.085005
173828	07_madison_1813	enemy	0.085005
174967	07_madison_1813	honorable	0.084762
493573	08_monroe_1817	states	0.184195
489653	08_monroe_1817	government	0.174125
489687	08_monroe_1817	great	0.160658
494415	08_monroe_1817	union	0.117193
491769	08_monroe_1817	people	0.112825
494418	08_monroe_1817	united	0.112076
487979	08_monroe_1817	dangers	0.108567
491313	08_monroe_1817	naval	0.104713
489410	08_monroe_1817	foreign	0.103460
492121	08_monroe_1817	principles	0.097766
183721	09_monroe_1821	great	0.173751
187607	09_monroe_1821	states	0.137384
186897	09_monroe_1821	revenue	0.115018
188745	09_monroe_1821	war	0.113785
185730	09_monroe_1821	parties	0.109318
188452	09_monroe_1821	united	0.108029
181484	09_monroe_1821	commerce	0.105001
183432	09_monroe_1821	force	0.102947
183478	09_monroe_1821	fortifications	0.098741
188014	09_monroe_1821	term	0.094808
431422	10_adams_john_quincy_1825	union	0.257335
426660	10_adams_john_quincy_1825	government	0.147726
426595	10_adams_john_quincy_1825	general	0.109221
429922	10_adams_john_quincy_1825	rights	0.096300
425471	10_adams_john_quincy_1825	dissensions	0.095289
429304	10_adams_john_quincy_1825	public	0.094573
424714	10_adams_john_quincy_1825	constitution	0.090300
428754	10_adams_john_quincy_1825	peace	0.088183
424871	10_adams_john_quincy_1825	country	0.086898
428791	10_adams_john_quincy_1825	performance	0.085565
96341	11_jackson_1829	public	0.160747
93633	11_jackson_1829	generally	0.122711
92371	11_jackson_1829	diffidence	0.112691
92130	11_jackson_1829	defending	0.105878
97272	11_jackson_1829	shall	0.104933
96907	11_jackson_1829	revenue	0.102776
98938	11_jackson_1829	worth	0.100312
93697	11_jackson_1829	government	0.099698
93306	11_jackson_1829	federal	0.093100
96034	11_jackson_1829	power	0.092071
89460	12_jackson_1833	union	0.212766
84698	12_jackson_1833	government	0.207559
88618	12_jackson_1833	states	0.141549
86814	12_jackson_1833	people	0.136557
87114	12_jackson_1833	preservation	0.128319
84633	12_jackson_1833	general	0.125422
84083	12_jackson_1833	exercise	0.119275
85236	12_jackson_1833	inculcate	0.116720
87285	12_jackson_1833	proportion	0.116720
87038	12_jackson_1833	powers	0.113757
4427	13_van_buren_1837	institutions	0.186889
5823	13_van_buren_1837	people	0.138465
3707	13_van_buren_1837	government	0.116561
7872	13_van_buren_1837	supposed	0.109949
1918	13_van_buren_1837	country	0.109276
243	13_van_buren_1837	actual	0.096382
3144	13_van_buren_1837	experience	0.093444
267	13_van_buren_1837	adherence	0.083833
1639	13_van_buren_1837	conduct	0.081635
5578	13_van_buren_1837	opinions	0.081597
51039	14_harrison_1841	power	0.204207
46756	14_harrison_1841	constitution	0.183336
48081	14_harrison_1841	executive	0.157153
50818	14_harrison_1841	people	0.141584
48702	14_harrison_1841	government	0.141142
52008	14_harrison_1841	roman	0.110538
52622	14_harrison_1841	states	0.108621
46367	14_harrison_1841	citizens	0.105857
46300	14_harrison_1841	character	0.102640
52617	14_harrison_1841	state	0.094976
314435	15_polk_1845	union	0.259054
309673	15_polk_1845	government	0.256967
313593	15_polk_1845	states	0.218122
314021	15_polk_1845	texas	0.199846
312883	15_polk_1845	revenue	0.146541
312013	15_polk_1845	powers	0.124655
312287	15_polk_1845	protection	0.107385
307727	15_polk_1845	constitution	0.106528
310443	15_polk_1845	interests	0.105054
309149	15_polk_1845	extended	0.090179
367242	16_taylor_1849	shall	0.266204
363667	16_taylor_1849	government	0.118031
362624	16_taylor_1849	duties	0.117893
365432	16_taylor_1849	object	0.104293
361650	16_taylor_1849	congress	0.103865
366328	16_taylor_1849	purity	0.101793
368626	16_taylor_1849	vested	0.101793
365066	16_taylor_1849	measures	0.101637
361878	16_taylor_1849	country	0.101169
360297	16_taylor_1849	affections	0.097017
39837	17_pierce_1853	hardly	0.114001
42040	17_pierce_1853	power	0.102456
42011	17_pierce_1853	position	0.086643
37758	17_pierce_1853	constitutional	0.086105
39123	17_pierce_1853	expect	0.084436
39703	17_pierce_1853	government	0.084048
36538	17_pierce_1853	apparent	0.080332
42621	17_pierce_1853	regarded	0.080332
43278	17_pierce_1853	shall	0.079615
40859	17_pierce_1853	like	0.079229
358588	18_buchanan_1857	states	0.208199
352722	18_buchanan_1857	constitution	0.188573
358243	18_buchanan_1857	shall	0.161784
357359	18_buchanan_1857	question	0.157007
359805	18_buchanan_1857	whilst	0.141119
359006	18_buchanan_1857	territory	0.140852
359430	18_buchanan_1857	union	0.126444
354668	18_buchanan_1857	government	0.119554
352651	18_buchanan_1857	congress	0.118357
356784	18_buchanan_1857	people	0.105501
208738	19_lincoln_1861	constitution	0.214478
215446	19_lincoln_1861	union	0.203738
208210	19_lincoln_1861	case	0.152422
214604	19_lincoln_1861	states	0.144861
212181	19_lincoln_1861	minority	0.131514
212800	19_lincoln_1861	people	0.130763
208372	19_lincoln_1861	clause	0.125738
210684	19_lincoln_1861	government	0.123837
214259	19_lincoln_1861	shall	0.123099
211735	19_lincoln_1861	law	0.122872
413720	20_lincoln_1865	war	0.267217
410490	20_lincoln_1865	offenses	0.234524
413868	20_lincoln_1865	woe	0.234524
408646	20_lincoln_1865	god	0.151269
410489	20_lincoln_1865	offense	0.141890
413830	20_lincoln_1865	wills	0.141890
405466	20_lincoln_1865	answered	0.131631
412370	20_lincoln_1865	slaves	0.123674
413424	20_lincoln_1865	union	0.114955
405400	20_lincoln_1865	altogether	0.111675
128578	21_grant_1869	dollar	0.270439
131782	21_grant_1869	paying	0.162263
128041	21_grant_1869	deal	0.152454
133513	21_grant_1869	specie	0.152454
128056	21_grant_1869	debt	0.135097
127904	21_grant_1869	country	0.127604
126308	21_grant_1869	advisable	0.116606
130751	21_grant_1869	laws	0.115834
131784	21_grant_1869	payments	0.108175
131780	21_grant_1869	pay	0.098658
303269	22_grant_1873	proposition	0.187222
299570	22_grant_1873	domingo	0.177516
304058	22_grant_1873	santo	0.177516
305193	22_grant_1873	transit	0.177516
305012	22_grant_1873	territory	0.121158
300160	22_grant_1873	extermination	0.118344
304617	22_grant_1873	steam	0.118344
304969	22_grant_1873	telegraph	0.118344
298885	22_grant_1873	country	0.117529
300153	22_grant_1873	extension	0.116618
397874	23_hayes_1877	country	0.186357
399663	23_hayes_1877	government	0.167722
396868	23_hayes_1877	behalf	0.128316
402307	23_hayes_1877	public	0.123944
401944	23_hayes_1877	political	0.121034
403583	23_hayes_1877	states	0.113587
401713	23_hayes_1877	party	0.112549
398461	23_hayes_1877	dispute	0.112503
401706	23_hayes_1877	parties	0.109554
402561	23_hayes_1877	reform	0.104365
291675	24_garfield_1881	government	0.186855
293791	24_garfield_1881	people	0.162132
289729	24_garfield_1881	constitution	0.158292
295595	24_garfield_1881	states	0.135047
296437	24_garfield_1881	union	0.132321
295778	24_garfield_1881	suffrage	0.119992
293368	24_garfield_1881	negro	0.118782
288756	24_garfield_1881	authority	0.117232
289658	24_garfield_1881	congress	0.112598
292726	24_garfield_1881	law	0.103639
68816	25_cleveland_1885	people	0.210468
66700	25_cleveland_1885	government	0.209164
68744	25_cleveland_1885	partisan	0.169436
69344	25_cleveland_1885	public	0.163662
70275	25_cleveland_1885	shall	0.129498
64754	25_cleveland_1885	constitution	0.127856
67470	25_cleveland_1885	interests	0.118207
66197	25_cleveland_1885	extravagance	0.111416
64363	25_cleveland_1885	citizen	0.102825
70708	25_cleveland_1885	strife	0.101661
383781	26_harrison_1889	people	0.172358
382723	26_harrison_1889	laws	0.154418
385585	26_harrison_1889	states	0.138614
378804	26_harrison_1889	ballot	0.137159
384309	26_harrison_1889	public	0.128566
383119	26_harrison_1889	methods	0.119162
385240	26_harrison_1889	shall	0.118483
381527	26_harrison_1889	friendly	0.104267
380968	26_harrison_1889	european	0.103349
379719	26_harrison_1889	constitution	0.089360
446774	27_cleveland_1893	people	0.221563
444658	27_cleveland_1893	government	0.148364
444533	27_cleveland_1893	frugality	0.128050
447302	27_cleveland_1893	public	0.102520
448203	27_cleveland_1893	service	0.101813
448817	27_cleveland_1893	support	0.099946
441413	27_cleveland_1893	american	0.097267
441192	27_cleveland_1893	activity	0.095964
444659	27_cleveland_1893	governmental	0.095964
442870	27_cleveland_1893	countrymen	0.088564
280659	28_mckinley_1897	congress	0.188773
285886	28_mckinley_1897	revenue	0.168489
284792	28_mckinley_1897	people	0.161797
282676	28_mckinley_1897	government	0.156633
283868	28_mckinley_1897	loans	0.149356
283766	28_mckinley_1897	legislation	0.126367
285320	28_mckinley_1897	public	0.107057
280123	28_mckinley_1897	business	0.106759
282710	28_mckinley_1897	great	0.105322
285900	28_mckinley_1897	revision	0.099571
229571	29_mckinley_1901	islands	0.216480
226964	29_mckinley_1901	cuba	0.206329
228682	29_mckinley_1901	government	0.153681
228061	29_mckinley_1901	executive	0.147843
229329	29_mckinley_1901	inhabitants	0.147374
226665	29_mckinley_1901	congress	0.141999
230798	29_mckinley_1901	people	0.116839
232602	29_mckinley_1901	states	0.102186
233447	29_mckinley_1901	united	0.100439
231074	29_mckinley_1901	preparation	0.097925
168610	30_roosevelt_theodore_1905	regards	0.199163
168177	30_roosevelt_theodore_1905	problems	0.182463
169958	30_roosevelt_theodore_1905	tasks	0.150068
162600	30_roosevelt_theodore_1905	aright	0.146306
168765	30_roosevelt_theodore_1905	republic	0.121428
166828	30_roosevelt_theodore_1905	life	0.118701
163230	30_roosevelt_theodore_1905	cause	0.116483
165199	30_roosevelt_theodore_1905	faced	0.115730
163619	30_roosevelt_theodore_1905	conditions	0.115373
170877	30_roosevelt_theodore_1905	wish	0.106771
517444	31_taft_1909	interstate	0.206957
514097	31_taft_1909	business	0.201378
520913	31_taft_1909	tariff	0.154802
518343	31_taft_1909	negro	0.153669
520442	31_taft_1909	south	0.129384
516650	31_taft_1909	government	0.121451
519229	31_taft_1909	proper	0.114684
519354	31_taft_1909	race	0.113413
516265	31_taft_1909	feeling	0.111255
514129	31_taft_1909	canal	0.110879
201719	32_wilson_1913	great	0.158659
203109	32_wilson_1913	men	0.142924
201245	32_wilson_1913	familiar	0.141669
205651	32_wilson_1913	stirred	0.141669
205718	32_wilson_1913	studied	0.141669
206055	32_wilson_1913	things	0.123915
202644	32_wilson_1913	justice	0.105759
201685	32_wilson_1913	government	0.105520
202824	32_wilson_1913	life	0.102168
202901	32_wilson_1913	look	0.100095
152880	33_wilson_1917	wished	0.228593
145888	33_wilson_1917	counsel	0.174639
150355	33_wilson_1917	purpose	0.152933
144219	33_wilson_1917	action	0.149960
151266	33_wilson_1917	shall	0.134404
152075	33_wilson_1917	thought	0.126568
151593	33_wilson_1917	stand	0.121313
151243	33_wilson_1917	set	0.111215
149976	33_wilson_1917	politics	0.108510
146621	33_wilson_1917	drawn	0.104783
251911	34_harding_1921	world	0.196268
244352	34_harding_1921	civilization	0.157095
243434	34_harding_1921	america	0.155684
251738	34_harding_1921	war	0.120631
249645	34_harding_1921	relationship	0.118846
249756	34_harding_1921	republic	0.117160
248578	34_harding_1921	order	0.110128
251362	34_harding_1921	understanding	0.109915
248386	34_harding_1921	new	0.097567
243442	34_harding_1921	amid	0.095077
262889	35_coolidge_1925	country	0.120814
266602	35_coolidge_1925	ought	0.116721
267750	35_coolidge_1925	represents	0.114495
268950	35_coolidge_1925	tax	0.112826
264712	35_coolidge_1925	great	0.109908
267259	35_coolidge_1925	property	0.108285
266728	35_coolidge_1925	party	0.107300
268585	35_coolidge_1925	stands	0.107257
266772	35_coolidge_1925	peace	0.104170
266794	35_coolidge_1925	people	0.101306
106831	36_hoover_1929	sup	0.296865
102696	36_hoover_1929	government	0.202690
101845	36_hoover_1929	enforcement	0.194371
99049	36_hoover_1929	18th	0.134706
105231	36_hoover_1929	progress	0.132406
102305	36_hoover_1929	federal	0.126183
103050	36_hoover_1929	ideals	0.113418
100143	36_hoover_1929	business	0.108323
103754	36_hoover_1929	laws	0.107051
104790	36_hoover_1929	peace	0.103883
345889	37_roosevelt_franklin_1933	helped	0.215644
346734	37_roosevelt_franklin_1933	leadership	0.191084
349671	37_roosevelt_franklin_1933	stricken	0.129390
344742	37_roosevelt_franklin_1933	emergency	0.123225
344399	37_roosevelt_franklin_1933	discipline	0.117971
348804	37_roosevelt_franklin_1933	respects	0.117971
347233	37_roosevelt_franklin_1933	money	0.113371
347316	37_roosevelt_franklin_1933	national	0.110570
348520	37_roosevelt_franklin_1933	recovery	0.102349
342197	37_roosevelt_franklin_1933	action	0.097007
335172	38_roosevelt_franklin_1937	democracy	0.178041
336670	38_roosevelt_franklin_1937	government	0.177222
338146	38_roosevelt_franklin_1937	millions	0.140722
338671	38_roosevelt_franklin_1937	paint	0.121461
338786	38_roosevelt_franklin_1937	people	0.115789
335664	38_roosevelt_franklin_1937	economic	0.114184
339954	38_roosevelt_franklin_1937	road	0.112944
339205	38_roosevelt_franklin_1937	progress	0.104463
335255	38_roosevelt_franklin_1937	despair	0.100302
338316	38_roosevelt_franklin_1937	nation	0.099688
272179	39_roosevelt_franklin_1941	democracy	0.244486
274674	39_roosevelt_franklin_1941	know	0.189060
277494	39_roosevelt_franklin_1941	speaks	0.183385
271040	39_roosevelt_franklin_1941	br	0.163241
275323	39_roosevelt_franklin_1941	nation	0.162241
270431	39_roosevelt_franklin_1941	america	0.140133
274816	39_roosevelt_franklin_1941	life	0.117815
277525	39_roosevelt_franklin_1941	spirit	0.114445
273521	39_roosevelt_franklin_1941	freedom	0.109295
270043	39_roosevelt_franklin_1941	1941	0.108911
472725	40_roosevelt_franklin_1945	learned	0.300396
475997	40_roosevelt_franklin_1945	test	0.194731
468022	40_roosevelt_franklin_1945	1945	0.189849
475230	40_roosevelt_franklin_1945	shall	0.173637
476211	40_roosevelt_franklin_1945	trend	0.172292
473184	40_roosevelt_franklin_1945	mistakes	0.159835
473749	40_roosevelt_franklin_1945	peace	0.159442
476096	40_roosevelt_franklin_1945	today	0.154299
476538	40_roosevelt_franklin_1945	upward	0.150172
471569	40_roosevelt_franklin_1945	gain	0.142278
143923	41_truman_1949	world	0.196051
140343	41_truman_1949	nations	0.194029
141225	41_truman_1949	program	0.171656
140809	41_truman_1949	peoples	0.166989
137194	41_truman_1949	democracy	0.154140
138536	41_truman_1949	freedom	0.149297
136513	41_truman_1949	communism	0.147134
140786	41_truman_1949	peace	0.144167
136902	41_truman_1949	countries	0.137013
141543	41_truman_1949	recovery	0.135785
462497	42_eisenhower_1953	free	0.205803
462199	42_eisenhower_1953	faith	0.154561
467887	42_eisenhower_1953	world	0.146449
464773	42_eisenhower_1953	peoples	0.139466
465172	42_eisenhower_1953	productivity	0.133826
466648	42_eisenhower_1953	strength	0.130430
464750	42_eisenhower_1953	peace	0.123845
462500	42_eisenhower_1953	freedom	0.123320
466231	42_eisenhower_1953	shall	0.105970
462919	42_eisenhower_1953	hold	0.105028
485885	43_eisenhower_1957	world	0.193893
480498	43_eisenhower_1957	freedom	0.179599
484143	43_eisenhower_1957	seek	0.176008
482305	43_eisenhower_1957	nations	0.175538
482771	43_eisenhower_1957	peoples	0.158270
484670	43_eisenhower_1957	strives	0.146428
482748	43_eisenhower_1957	peace	0.136639
480873	43_eisenhower_1957	help	0.132688
479515	43_eisenhower_1957	divided	0.115450
482262	43_eisenhower_1957	mr	0.111886
391774	44_kennedy_1961	let	0.267869
394306	44_kennedy_1961	sides	0.262849
392921	44_kennedy_1961	pledge	0.160960
387632	44_kennedy_1961	ask	0.107713
387864	44_kennedy_1961	begin	0.106495
388991	44_kennedy_1961	dare	0.106495
395895	44_kennedy_1961	world	0.103110
390313	44_kennedy_1961	final	0.102311
392370	44_kennedy_1961	new	0.096600
390120	44_kennedy_1961	explore	0.094223
109283	45_johnson_1965	change	0.276090
109919	45_johnson_1965	covenant	0.242891
113001	45_johnson_1965	man	0.174391
113063	45_johnson_1965	mastery	0.153532
113341	45_johnson_1965	nation	0.152475
116457	45_johnson_1965	union	0.150512
113540	45_johnson_1965	old	0.129184
116299	45_johnson_1965	trying	0.109663
113811	45_johnson_1965	people	0.108677
111846	45_johnson_1965	harvest	0.102355
458677	46_nixon_1969	voices	0.208854
455751	46_nixon_1969	peace	0.144624
454767	46_nixon_1969	let	0.140977
452636	46_nixon_1969	earth	0.139513
454654	46_nixon_1969	know	0.137969
454963	46_nixon_1969	man	0.135416
455773	46_nixon_1969	people	0.131270
458888	46_nixon_1969	world	0.128264
456896	46_nixon_1969	rhetoric	0.119219
453462	46_nixon_1969	forward	0.113215
9460	47_nixon_1973	america	0.307074
13816	47_nixon_1973	let	0.282212
14800	47_nixon_1973	peace	0.211567
16008	47_nixon_1973	role	0.190395
17937	47_nixon_1973	world	0.177760
14983	47_nixon_1973	policies	0.176224
15848	47_nixon_1973	responsibility	0.164016
14412	47_nixon_1973	new	0.158606
9147	47_nixon_1973	abroad	0.154815
12977	47_nixon_1973	home	0.126653
190049	48_carter_1977	br	0.222574
194332	48_carter_1977	nation	0.191717
191619	48_carter_1977	dream	0.181515
196678	48_carter_1977	strength	0.147104
194392	48_carter_1977	new	0.142111
194143	48_carter_1977	micah	0.118797
197040	48_carter_1977	thee	0.107811
196534	48_carter_1977	spirit	0.107000
193001	48_carter_1977	human	0.101203
191853	48_carter_1977	enhance	0.100016
156690	49_reagan_1981	government	0.162397
153447	49_reagan_1981	americans	0.156895
156925	49_reagan_1981	heroes	0.137410
153904	49_reagan_1981	believe	0.136126
161636	49_reagan_1981	ve	0.115339
159206	49_reagan_1981	productivity	0.104753
161801	49_reagan_1981	weapon	0.104753
156534	49_reagan_1981	freedom	0.102964
155625	49_reagan_1981	dreams	0.101106
161131	49_reagan_1981	today	0.093813
21705	50_reagan_1985	government	0.161165
21549	50_reagan_1985	freedom	0.159998
23452	50_reagan_1985	nuclear	0.153623
26651	50_reagan_1985	ve	0.153623
26817	50_reagan_1985	weapons	0.140173
23821	50_reagan_1985	people	0.137038
26936	50_reagan_1985	world	0.127236
21963	50_reagan_1985	history	0.104777
22020	50_reagan_1985	human	0.104777
25219	50_reagan_1985	senator	0.102416
119593	51_bush_george_h_w_1989	don	0.186313
118075	51_bush_george_h_w_1989	breeze	0.184416
122400	51_bush_george_h_w_1989	new	0.137266
120557	51_bush_george_h_w_1989	friends	0.136820
119597	51_bush_george_h_w_1989	door	0.133889
125912	51_bush_george_h_w_1989	word	0.131722
122302	51_bush_george_h_w_1989	mr	0.126821
120800	51_bush_george_h_w_1989	hand	0.125086
118005	51_bush_george_h_w_1989	blowing	0.110649
125064	51_bush_george_h_w_1989	things	0.110609
252433	52_clinton_1993	america	0.318908
260910	52_clinton_1993	world	0.226715
252436	52_clinton_1993	americans	0.206865
260120	52_clinton_1993	today	0.185539
253267	52_clinton_1993	change	0.170522
258709	52_clinton_1993	renewal	0.136867
259135	52_clinton_1993	season	0.136867
256028	52_clinton_1993	idea	0.134993
256789	52_clinton_1993	let	0.132521
257795	52_clinton_1993	people	0.129272
28273	53_clinton_1997	century	0.321300
32410	53_clinton_1997	new	0.279600
27458	53_clinton_1997	america	0.199997
33257	53_clinton_1997	promise	0.164327
35935	53_clinton_1997	world	0.135071
31724	53_clinton_1997	land	0.131027
32350	53_clinton_1997	nation	0.117062
27461	53_clinton_1997	americans	0.115029
35131	53_clinton_1997	time	0.108057
31814	53_clinton_1997	let	0.105270
322651	54_bush_george_w_2001	story	0.341166
315426	54_bush_george_w_2001	america	0.193152
316343	54_bush_george_w_2001	civility	0.160853
320318	54_bush_george_w_2001	nation	0.130448
315304	54_bush_george_w_2001	affirm	0.120640
319026	54_bush_george_w_2001	ideals	0.109491
315429	54_bush_george_w_2001	americans	0.108207
321225	54_bush_george_w_2001	promise	0.108207
316509	54_bush_george_w_2001	compassion	0.107388
316337	54_bush_george_w_2001	citizens	0.106730
435503	55_bush_george_w_2005	freedom	0.349948
432413	55_bush_george_w_2005	america	0.284882
436792	55_bush_george_w_2005	liberty	0.174494
432416	55_bush_george_w_2005	americans	0.140443
440278	55_bush_george_w_2005	tyranny	0.127272
439154	55_bush_george_w_2005	seen	0.110386
437305	55_bush_george_w_2005	nation	0.096199
433200	55_bush_george_w_2005	cause	0.092545
435917	55_bush_george_w_2005	history	0.092422
433132	55_bush_george_w_2005	came	0.091988
54455	56_obama_2009	america	0.148351
59347	56_obama_2009	nation	0.120229
59407	56_obama_2009	new	0.118002
62142	56_obama_2009	today	0.114792
57639	56_obama_2009	generation	0.100654
58811	56_obama_2009	let	0.091100
58627	56_obama_2009	jobs	0.090727
55960	56_obama_2009	crisis	0.087235
57828	56_obama_2009	hard	0.084859
62910	56_obama_2009	women	0.084859
418595	57_obama_2013	journey	0.167591
415909	57_obama_2013	creed	0.139659
417599	57_obama_2013	generation	0.127260
414415	57_obama_2013	america	0.125044
415519	57_obama_2013	complete	0.114891
420751	57_obama_2013	requires	0.114891
419777	57_obama_2013	people	0.110351
422088	57_obama_2013	time	0.105563
422102	57_obama_2013	today	0.103668
416980	57_obama_2013	evident	0.100896
504405	58_trump_2017	america	0.350162
506586	58_trump_2017	dreams	0.156436
504406	58_trump_2017	american	0.149226
508577	58_trump_2017	jobs	0.142766
510263	58_trump_2017	protected	0.132439
509410	58_trump_2017	obama	0.120288
509767	58_trump_2017	people	0.112370
512002	58_trump_2017	thank	0.109171
504990	58_trump_2017	borders	0.107075
512597	58_trump_2017	ve	0.107075

top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

We can zoom in on particular words and particular documents.

top_tfidf[top_tfidf['term'].str.contains('women')]

	document	term	tfidf
62910	56_obama_2009	women	0.084859

It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.

top_tfidf[top_tfidf['document'].str.contains('obama')]

	document	term	tfidf
54455	56_obama_2009	america	0.148351
59347	56_obama_2009	nation	0.120229
59407	56_obama_2009	new	0.118002
62142	56_obama_2009	today	0.114792
57639	56_obama_2009	generation	0.100654
58811	56_obama_2009	let	0.091100
58627	56_obama_2009	jobs	0.090727
55960	56_obama_2009	crisis	0.087235
57828	56_obama_2009	hard	0.084859
62910	56_obama_2009	women	0.084859
418595	57_obama_2013	journey	0.167591
415909	57_obama_2013	creed	0.139659
417599	57_obama_2013	generation	0.127260
414415	57_obama_2013	america	0.125044
415519	57_obama_2013	complete	0.114891
420751	57_obama_2013	requires	0.114891
419777	57_obama_2013	people	0.110351
422088	57_obama_2013	time	0.105563
422102	57_obama_2013	today	0.103668
416980	57_obama_2013	evident	0.100896

top_tfidf[top_tfidf['document'].str.contains('trump')]

	document	term	tfidf
504405	58_trump_2017	america	0.350162
506586	58_trump_2017	dreams	0.156436
504406	58_trump_2017	american	0.149226
508577	58_trump_2017	jobs	0.142766
510263	58_trump_2017	protected	0.132439
509410	58_trump_2017	obama	0.120288
509767	58_trump_2017	people	0.112370
512002	58_trump_2017	thank	0.109171
504990	58_trump_2017	borders	0.107075
512597	58_trump_2017	ve	0.107075

top_tfidf[top_tfidf['document'].str.contains('kennedy')]

	document	term	tfidf
391774	44_kennedy_1961	let	0.267869
394306	44_kennedy_1961	sides	0.262849
392921	44_kennedy_1961	pledge	0.160960
387632	44_kennedy_1961	ask	0.107713
387864	44_kennedy_1961	begin	0.106495
388991	44_kennedy_1961	dare	0.106495
395895	44_kennedy_1961	world	0.103110
390313	44_kennedy_1961	final	0.102311
392370	44_kennedy_1961	new	0.096600
390120	44_kennedy_1961	explore	0.094223

Visualize TF-IDF#

We can also visualize our TF-IDF results with the data visualization library Altair.

!pip install altair

Let’s make a heatmap that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: “war” and “peace”:

The code below was contributed by Eric Monson. Thanks, Eric!

Your Turn!#

Take a few minutes to explore the dataframe below and then answer the following questions.

1. What is the difference between a tf-idf score and raw word frequency?

2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?