TF-IDF with Scikit-Learn#
In the previous lesson, we learned about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf. Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. We specifically learned how to calculate tf-idf scores using word frequencies per page—or “extracted features”—made available by the HathiTrust Digital Library.
In this lesson, we’re going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called TfidfVectorizer.
In this lesson, we will cover how to:
Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn
Dataset#
U.S. Inaugural Addresses#
This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath. So let us mark this day with remembrance of who we are and how far we have traveled.
—Barack Obama, Inaugural Presidential Address, January 2009
During Barack Obama’s Inaugural Address in January 2009, he mentioned “women” four different times, including in the passage quoted above. How distinctive is Obama’s inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we’re going to try to answer with tf-idf.
Breaking Down the TF-IDF Formula#
But first, let’s quickly discuss the tf-idf formula. The idea is pretty simple.
tf-idf = term_frequency * inverse_document_frequency
term_frequency = number of times a given term appears in document
inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****
You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).
The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word “said” vs the word “pigeon.” The term “said” appears in 13 (document frequency) of 14 (total documents) Lost in the City stories (14 / 13 –> a smaller inverse document frequency) while the term “pigeons” only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 –> a bigger inverse document frequency, a bigger tf-idf boost).
*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we’re going to use is the scikit-learn default, which uses “smoothing” aka it adds a “1” to the numerator and denominator:
inverse_document_frequency = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1
TF-IDF with scikit-learn#
scikit-learn, imported as sklearn
, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we’re not doing any machine learning in this lesson, we’re nevertheless going to use scikit-learn’s TfidfVectorizer
and CountVectorizer
.
Install scikit-learn
!pip install sklearn
Import necessary modules and libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
from pathlib import Path
import glob
Pandas Review
Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!
We’re also going to import pandas
and change its default display setting. And we’re going to import two libraries that will help us work with files and the file system: pathlib
and glob
.
Set Directory Path#
Below we’re setting the directory filepath that contains all the text files that we want to analyze.
directory_path = "../texts/history/US_Inaugural_Addresses/"
Then we’re going to use glob
and Path
to make a list of all the filepaths in that directory and a list of all the short story titles.
text_files = glob.glob(f"{directory_path}/*.txt")
text_files
['../texts/history/US_Inaugural_Addresses/13_van_buren_1837.txt',
'../texts/history/US_Inaugural_Addresses/47_nixon_1973.txt',
'../texts/history/US_Inaugural_Addresses/50_reagan_1985.txt',
'../texts/history/US_Inaugural_Addresses/53_clinton_1997.txt',
'../texts/history/US_Inaugural_Addresses/17_pierce_1853.txt',
'../texts/history/US_Inaugural_Addresses/14_harrison_1841.txt',
'../texts/history/US_Inaugural_Addresses/56_obama_2009.txt',
'../texts/history/US_Inaugural_Addresses/25_cleveland_1885.txt',
'../texts/history/US_Inaugural_Addresses/03_adams_john_1797.txt',
'../texts/history/US_Inaugural_Addresses/12_jackson_1833.txt',
'../texts/history/US_Inaugural_Addresses/11_jackson_1829.txt',
'../texts/history/US_Inaugural_Addresses/36_hoover_1929.txt',
'../texts/history/US_Inaugural_Addresses/45_johnson_1965.txt',
'../texts/history/US_Inaugural_Addresses/51_bush_george_h_w_1989.txt',
'../texts/history/US_Inaugural_Addresses/21_grant_1869.txt',
'../texts/history/US_Inaugural_Addresses/41_truman_1949.txt',
'../texts/history/US_Inaugural_Addresses/33_wilson_1917.txt',
'../texts/history/US_Inaugural_Addresses/49_reagan_1981.txt',
'../texts/history/US_Inaugural_Addresses/30_roosevelt_theodore_1905.txt',
'../texts/history/US_Inaugural_Addresses/07_madison_1813.txt',
'../texts/history/US_Inaugural_Addresses/09_monroe_1821.txt',
'../texts/history/US_Inaugural_Addresses/48_carter_1977.txt',
'../texts/history/US_Inaugural_Addresses/32_wilson_1913.txt',
'../texts/history/US_Inaugural_Addresses/19_lincoln_1861.txt',
'../texts/history/US_Inaugural_Addresses/01_washington_1789.txt',
'../texts/history/US_Inaugural_Addresses/29_mckinley_1901.txt',
'../texts/history/US_Inaugural_Addresses/04_jefferson_1801.txt',
'../texts/history/US_Inaugural_Addresses/34_harding_1921.txt',
'../texts/history/US_Inaugural_Addresses/52_clinton_1993.txt',
'../texts/history/US_Inaugural_Addresses/35_coolidge_1925.txt',
'../texts/history/US_Inaugural_Addresses/39_roosevelt_franklin_1941.txt',
'../texts/history/US_Inaugural_Addresses/28_mckinley_1897.txt',
'../texts/history/US_Inaugural_Addresses/24_garfield_1881.txt',
'../texts/history/US_Inaugural_Addresses/22_grant_1873.txt',
'../texts/history/US_Inaugural_Addresses/15_polk_1845.txt',
'../texts/history/US_Inaugural_Addresses/54_bush_george_w_2001.txt',
'../texts/history/US_Inaugural_Addresses/02_washington_1793.txt',
'../texts/history/US_Inaugural_Addresses/38_roosevelt_franklin_1937.txt',
'../texts/history/US_Inaugural_Addresses/37_roosevelt_franklin_1933.txt',
'../texts/history/US_Inaugural_Addresses/18_buchanan_1857.txt',
'../texts/history/US_Inaugural_Addresses/16_taylor_1849.txt',
'../texts/history/US_Inaugural_Addresses/05_jefferson_1805.txt',
'../texts/history/US_Inaugural_Addresses/26_harrison_1889.txt',
'../texts/history/US_Inaugural_Addresses/44_kennedy_1961.txt',
'../texts/history/US_Inaugural_Addresses/23_hayes_1877.txt',
'../texts/history/US_Inaugural_Addresses/20_lincoln_1865.txt',
'../texts/history/US_Inaugural_Addresses/57_obama_2013.txt',
'../texts/history/US_Inaugural_Addresses/10_adams_john_quincy_1825.txt',
'../texts/history/US_Inaugural_Addresses/55_bush_george_w_2005.txt',
'../texts/history/US_Inaugural_Addresses/27_cleveland_1893.txt',
'../texts/history/US_Inaugural_Addresses/46_nixon_1969.txt',
'../texts/history/US_Inaugural_Addresses/42_eisenhower_1953.txt',
'../texts/history/US_Inaugural_Addresses/40_roosevelt_franklin_1945.txt',
'../texts/history/US_Inaugural_Addresses/43_eisenhower_1957.txt',
'../texts/history/US_Inaugural_Addresses/08_monroe_1817.txt',
'../texts/history/US_Inaugural_Addresses/06_madison_1809.txt',
'../texts/history/US_Inaugural_Addresses/58_trump_2017.txt',
'../texts/history/US_Inaugural_Addresses/31_taft_1909.txt']
text_titles = [Path(text).stem for text in text_files]
text_titles
['13_van_buren_1837',
'47_nixon_1973',
'50_reagan_1985',
'53_clinton_1997',
'17_pierce_1853',
'14_harrison_1841',
'56_obama_2009',
'25_cleveland_1885',
'03_adams_john_1797',
'12_jackson_1833',
'11_jackson_1829',
'36_hoover_1929',
'45_johnson_1965',
'51_bush_george_h_w_1989',
'21_grant_1869',
'41_truman_1949',
'33_wilson_1917',
'49_reagan_1981',
'30_roosevelt_theodore_1905',
'07_madison_1813',
'09_monroe_1821',
'48_carter_1977',
'32_wilson_1913',
'19_lincoln_1861',
'01_washington_1789',
'29_mckinley_1901',
'04_jefferson_1801',
'34_harding_1921',
'52_clinton_1993',
'35_coolidge_1925',
'39_roosevelt_franklin_1941',
'28_mckinley_1897',
'24_garfield_1881',
'22_grant_1873',
'15_polk_1845',
'54_bush_george_w_2001',
'02_washington_1793',
'38_roosevelt_franklin_1937',
'37_roosevelt_franklin_1933',
'18_buchanan_1857',
'16_taylor_1849',
'05_jefferson_1805',
'26_harrison_1889',
'44_kennedy_1961',
'23_hayes_1877',
'20_lincoln_1865',
'57_obama_2013',
'10_adams_john_quincy_1825',
'55_bush_george_w_2005',
'27_cleveland_1893',
'46_nixon_1969',
'42_eisenhower_1953',
'40_roosevelt_franklin_1945',
'43_eisenhower_1957',
'08_monroe_1817',
'06_madison_1809',
'58_trump_2017',
'31_taft_1909']
Calculate tf–idf#
To calculate tf–idf scores for every word, we’re going to use scikit-learn’s TfidfVectorizer
.
When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.
The recommended way to run TfidfVectorizer
is with smoothing (smooth_idf = True
) and normalization (norm='l2'
) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for TfidfVectorizer
, so to turn them on, you don’t need to include any extra code at all.
Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
Run TfidfVectorizer on our text_files
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)
Make a DataFrame out of the resulting tf–idf vector, setting the “feature names” or words as columns and the titles as rows
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
Add column for document frequency aka number of times word appears in all documents
tfidf_df.loc['00_Document Frequency'] = (tfidf_df > 0).sum()
tfidf_slice = tfidf_df[['government', 'borders', 'people', 'obama', 'war', 'honor','foreign', 'men', 'women', 'children']]
tfidf_slice.sort_index().round(decimals=2)
government | borders | people | obama | war | honor | foreign | men | women | children | |
---|---|---|---|---|---|---|---|---|---|---|
00_Document Frequency | 53.00 | 5.00 | 56.00 | 3.00 | 45.00 | 32.00 | 32.00 | 47.00 | 15.00 | 22.00 |
01_washington_1789 | 0.11 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 |
02_washington_1793 | 0.06 | 0.00 | 0.05 | 0.00 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 |
03_adams_john_1797 | 0.16 | 0.00 | 0.19 | 0.00 | 0.01 | 0.10 | 0.12 | 0.04 | 0.00 | 0.00 |
04_jefferson_1801 | 0.16 | 0.00 | 0.01 | 0.00 | 0.01 | 0.04 | 0.00 | 0.04 | 0.00 | 0.00 |
05_jefferson_1805 | 0.03 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.06 | 0.01 | 0.00 | 0.02 |
06_madison_1809 | 0.00 | 0.00 | 0.02 | 0.00 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.00 |
07_madison_1813 | 0.04 | 0.00 | 0.04 | 0.00 | 0.25 | 0.02 | 0.02 | 0.00 | 0.00 | 0.00 |
08_monroe_1817 | 0.17 | 0.00 | 0.11 | 0.00 | 0.09 | 0.01 | 0.10 | 0.04 | 0.00 | 0.00 |
09_monroe_1821 | 0.08 | 0.00 | 0.06 | 0.00 | 0.11 | 0.02 | 0.04 | 0.01 | 0.00 | 0.01 |
10_adams_john_quincy_1825 | 0.15 | 0.00 | 0.06 | 0.00 | 0.05 | 0.01 | 0.08 | 0.03 | 0.00 | 0.00 |
11_jackson_1829 | 0.10 | 0.00 | 0.06 | 0.00 | 0.02 | 0.02 | 0.07 | 0.02 | 0.00 | 0.00 |
12_jackson_1833 | 0.21 | 0.00 | 0.14 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
13_van_buren_1837 | 0.12 | 0.00 | 0.14 | 0.00 | 0.02 | 0.02 | 0.06 | 0.02 | 0.00 | 0.01 |
14_harrison_1841 | 0.14 | 0.00 | 0.14 | 0.00 | 0.01 | 0.02 | 0.03 | 0.03 | 0.00 | 0.00 |
15_polk_1845 | 0.26 | 0.00 | 0.08 | 0.00 | 0.03 | 0.01 | 0.09 | 0.02 | 0.00 | 0.01 |
16_taylor_1849 | 0.12 | 0.00 | 0.05 | 0.00 | 0.00 | 0.02 | 0.05 | 0.00 | 0.00 | 0.00 |
17_pierce_1853 | 0.08 | 0.00 | 0.05 | 0.00 | 0.00 | 0.02 | 0.04 | 0.01 | 0.00 | 0.03 |
18_buchanan_1857 | 0.12 | 0.00 | 0.11 | 0.00 | 0.08 | 0.01 | 0.04 | 0.03 | 0.00 | 0.05 |
19_lincoln_1861 | 0.12 | 0.00 | 0.13 | 0.00 | 0.02 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
20_lincoln_1865 | 0.02 | 0.00 | 0.00 | 0.00 | 0.27 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 |
21_grant_1869 | 0.05 | 0.00 | 0.03 | 0.00 | 0.02 | 0.05 | 0.05 | 0.02 | 0.00 | 0.00 |
22_grant_1873 | 0.06 | 0.00 | 0.10 | 0.00 | 0.05 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
23_hayes_1877 | 0.17 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.04 | 0.02 | 0.00 | 0.00 |
24_garfield_1881 | 0.19 | 0.00 | 0.16 | 0.00 | 0.05 | 0.00 | 0.00 | 0.01 | 0.00 | 0.04 |
25_cleveland_1885 | 0.21 | 0.00 | 0.21 | 0.00 | 0.00 | 0.00 | 0.05 | 0.01 | 0.00 | 0.00 |
26_harrison_1889 | 0.06 | 0.00 | 0.17 | 0.00 | 0.02 | 0.03 | 0.01 | 0.04 | 0.00 | 0.00 |
27_cleveland_1893 | 0.15 | 0.00 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 |
28_mckinley_1897 | 0.16 | 0.00 | 0.16 | 0.00 | 0.05 | 0.03 | 0.06 | 0.02 | 0.00 | 0.00 |
29_mckinley_1901 | 0.15 | 0.00 | 0.12 | 0.00 | 0.08 | 0.04 | 0.01 | 0.01 | 0.00 | 0.00 |
30_roosevelt_theodore_1905 | 0.05 | 0.00 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 0.10 |
31_taft_1909 | 0.12 | 0.00 | 0.03 | 0.00 | 0.03 | 0.01 | 0.03 | 0.03 | 0.00 | 0.00 |
32_wilson_1913 | 0.11 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.10 | 0.06 |
33_wilson_1917 | 0.00 | 0.00 | 0.08 | 0.00 | 0.07 | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 |
34_harding_1921 | 0.08 | 0.00 | 0.05 | 0.00 | 0.12 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 |
35_coolidge_1925 | 0.10 | 0.00 | 0.10 | 0.00 | 0.02 | 0.01 | 0.02 | 0.02 | 0.02 | 0.00 |
36_hoover_1929 | 0.20 | 0.04 | 0.10 | 0.00 | 0.01 | 0.00 | 0.01 | 0.03 | 0.03 | 0.01 |
37_roosevelt_franklin_1933 | 0.03 | 0.00 | 0.08 | 0.00 | 0.02 | 0.02 | 0.03 | 0.02 | 0.00 | 0.00 |
38_roosevelt_franklin_1937 | 0.18 | 0.03 | 0.12 | 0.00 | 0.01 | 0.00 | 0.00 | 0.10 | 0.07 | 0.02 |
39_roosevelt_franklin_1941 | 0.05 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.03 | 0.00 |
40_roosevelt_franklin_1945 | 0.00 | 0.00 | 0.02 | 0.00 | 0.05 | 0.03 | 0.00 | 0.10 | 0.05 | 0.04 |
41_truman_1949 | 0.03 | 0.00 | 0.10 | 0.00 | 0.02 | 0.01 | 0.01 | 0.06 | 0.00 | 0.00 |
42_eisenhower_1953 | 0.01 | 0.00 | 0.10 | 0.00 | 0.04 | 0.03 | 0.00 | 0.07 | 0.00 | 0.00 |
43_eisenhower_1957 | 0.00 | 0.00 | 0.10 | 0.00 | 0.01 | 0.05 | 0.00 | 0.07 | 0.00 | 0.00 |
44_kennedy_1961 | 0.00 | 0.00 | 0.01 | 0.00 | 0.06 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 |
45_johnson_1965 | 0.01 | 0.00 | 0.11 | 0.00 | 0.01 | 0.00 | 0.02 | 0.03 | 0.00 | 0.05 |
46_nixon_1969 | 0.05 | 0.00 | 0.13 | 0.00 | 0.03 | 0.03 | 0.00 | 0.01 | 0.00 | 0.00 |
47_nixon_1973 | 0.10 | 0.00 | 0.06 | 0.00 | 0.03 | 0.01 | 0.00 | 0.00 | 0.00 | 0.02 |
48_carter_1977 | 0.06 | 0.00 | 0.08 | 0.00 | 0.02 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
49_reagan_1981 | 0.16 | 0.00 | 0.08 | 0.00 | 0.01 | 0.00 | 0.00 | 0.02 | 0.04 | 0.09 |
50_reagan_1985 | 0.16 | 0.00 | 0.14 | 0.00 | 0.01 | 0.01 | 0.00 | 0.03 | 0.04 | 0.00 |
51_bush_george_h_w_1989 | 0.05 | 0.00 | 0.06 | 0.00 | 0.03 | 0.00 | 0.01 | 0.04 | 0.06 | 0.07 |
52_clinton_1993 | 0.05 | 0.00 | 0.13 | 0.00 | 0.03 | 0.00 | 0.02 | 0.01 | 0.02 | 0.06 |
53_clinton_1997 | 0.09 | 0.00 | 0.09 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.02 | 0.10 |
54_bush_george_w_2001 | 0.05 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.08 |
55_bush_george_w_2005 | 0.03 | 0.06 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.02 | 0.04 | 0.00 |
56_obama_2009 | 0.03 | 0.03 | 0.07 | 0.03 | 0.02 | 0.01 | 0.00 | 0.04 | 0.08 | 0.05 |
57_obama_2013 | 0.04 | 0.00 | 0.11 | 0.04 | 0.04 | 0.00 | 0.00 | 0.04 | 0.04 | 0.06 |
58_trump_2017 | 0.04 | 0.11 | 0.11 | 0.12 | 0.00 | 0.00 | 0.05 | 0.03 | 0.05 | 0.04 |
Let’s drop “OO_Document Frequency” since we were just using it for illustration purposes.
tfidf_df = tfidf_df.drop('00_Document Frequency', errors='ignore')
Let’s reorganize the DataFrame so that the words are in rows rather than columns.
tfidf_df.stack().reset_index()
level_0 | level_1 | 0 | |
---|---|---|---|
0 | 13_van_buren_1837 | 000 | 0.000000 |
1 | 13_van_buren_1837 | 03 | 0.011681 |
2 | 13_van_buren_1837 | 04 | 0.011924 |
3 | 13_van_buren_1837 | 05 | 0.000000 |
4 | 13_van_buren_1837 | 100 | 0.000000 |
... | ... | ... | ... |
521937 | 31_taft_1909 | zachary | 0.000000 |
521938 | 31_taft_1909 | zeal | 0.000000 |
521939 | 31_taft_1909 | zealous | 0.000000 |
521940 | 31_taft_1909 | zealously | 0.000000 |
521941 | 31_taft_1909 | zone | 0.000000 |
521942 rows × 3 columns
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
To find out the top 10 words with the highest tf–idf for every story, we’re going to sort by document and tfidf score and then groupby document and take the first 10 values.
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
document | term | tfidf | |
---|---|---|---|
219683 | 01_washington_1789 | government | 0.113681 |
220084 | 01_washington_1789 | immutable | 0.103883 |
220151 | 01_washington_1789 | impressions | 0.103883 |
222313 | 01_washington_1789 | providential | 0.103883 |
221607 | 01_washington_1789 | ought | 0.103728 |
222327 | 01_washington_1789 | public | 0.103102 |
222093 | 01_washington_1789 | present | 0.097516 |
222365 | 01_washington_1789 | qualifications | 0.096372 |
221787 | 01_washington_1789 | peculiarly | 0.090546 |
216629 | 01_washington_1789 | article | 0.085786 |
323983 | 02_washington_1793 | 1793 | 0.229350 |
324608 | 02_washington_1793 | arrive | 0.229350 |
332541 | 02_washington_1793 | upbraidings | 0.229350 |
328215 | 02_washington_1793 | incurring | 0.208140 |
332665 | 02_washington_1793 | violated | 0.208140 |
332837 | 02_washington_1793 | willingly | 0.208140 |
328333 | 02_washington_1793 | injunctions | 0.193091 |
328670 | 02_washington_1793 | knowingly | 0.193091 |
330122 | 02_washington_1793 | previous | 0.193091 |
332875 | 02_washington_1793 | witnesses | 0.193091 |
77815 | 03_adams_john_1797 | people | 0.191180 |
75699 | 03_adams_john_1797 | government | 0.160937 |
77953 | 03_adams_john_1797 | pleasing | 0.147066 |
75456 | 03_adams_john_1797 | foreign | 0.116874 |
77350 | 03_adams_john_1797 | nations | 0.114480 |
80705 | 03_adams_john_1797 | virtuous | 0.110813 |
76010 | 03_adams_john_1797 | houses | 0.110300 |
76793 | 03_adams_john_1797 | legislatures | 0.110300 |
73753 | 03_adams_john_1797 | constitution | 0.104525 |
75977 | 03_adams_john_1797 | honor | 0.102265 |
237681 | 04_jefferson_1801 | government | 0.155691 |
240148 | 04_jefferson_1801 | principle | 0.130113 |
238791 | 04_jefferson_1801 | let | 0.117970 |
241040 | 04_jefferson_1801 | safety | 0.108427 |
238987 | 04_jefferson_1801 | man | 0.106841 |
242072 | 04_jefferson_1801 | thousandth | 0.104513 |
237956 | 04_jefferson_1801 | honest | 0.101696 |
237304 | 04_jefferson_1801 | fellow | 0.097240 |
240870 | 04_jefferson_1801 | retire | 0.094848 |
239551 | 04_jefferson_1801 | opinion | 0.092587 |
375310 | 05_jefferson_1805 | public | 0.180456 |
372220 | 05_jefferson_1805 | false | 0.135863 |
376581 | 05_jefferson_1805 | state | 0.121514 |
377799 | 05_jefferson_1805 | whatsoever | 0.116886 |
373835 | 05_jefferson_1805 | limits | 0.107085 |
370331 | 05_jefferson_1805 | citizens | 0.106592 |
375449 | 05_jefferson_1805 | reason | 0.104438 |
370444 | 05_jefferson_1805 | comforts | 0.101880 |
375094 | 05_jefferson_1805 | press | 0.101549 |
372101 | 05_jefferson_1805 | expenses | 0.096524 |
499129 | 06_madison_1809 | improvements | 0.152559 |
495873 | 06_madison_1809 | belligerent | 0.123161 |
501296 | 06_madison_1809 | public | 0.122235 |
500303 | 06_madison_1809 | nations | 0.104588 |
501678 | 06_madison_1809 | rendered | 0.101706 |
495732 | 06_madison_1809 | authorities | 0.089155 |
495743 | 06_madison_1809 | avail | 0.089155 |
497991 | 06_madison_1809 | examples | 0.089155 |
496848 | 06_madison_1809 | councils | 0.085894 |
500508 | 06_madison_1809 | ones | 0.085894 |
179746 | 07_madison_1813 | war | 0.254249 |
172088 | 07_madison_1813 | british | 0.222972 |
176049 | 07_madison_1813 | massacre | 0.119009 |
172191 | 07_madison_1813 | captives | 0.108003 |
172963 | 07_madison_1813 | cruel | 0.108003 |
177162 | 07_madison_1813 | prisoners | 0.108003 |
178083 | 07_madison_1813 | savage | 0.108003 |
173727 | 07_madison_1813 | element | 0.085005 |
173828 | 07_madison_1813 | enemy | 0.085005 |
174967 | 07_madison_1813 | honorable | 0.084762 |
493573 | 08_monroe_1817 | states | 0.184195 |
489653 | 08_monroe_1817 | government | 0.174125 |
489687 | 08_monroe_1817 | great | 0.160658 |
494415 | 08_monroe_1817 | union | 0.117193 |
491769 | 08_monroe_1817 | people | 0.112825 |
494418 | 08_monroe_1817 | united | 0.112076 |
487979 | 08_monroe_1817 | dangers | 0.108567 |
491313 | 08_monroe_1817 | naval | 0.104713 |
489410 | 08_monroe_1817 | foreign | 0.103460 |
492121 | 08_monroe_1817 | principles | 0.097766 |
183721 | 09_monroe_1821 | great | 0.173751 |
187607 | 09_monroe_1821 | states | 0.137384 |
186897 | 09_monroe_1821 | revenue | 0.115018 |
188745 | 09_monroe_1821 | war | 0.113785 |
185730 | 09_monroe_1821 | parties | 0.109318 |
188452 | 09_monroe_1821 | united | 0.108029 |
181484 | 09_monroe_1821 | commerce | 0.105001 |
183432 | 09_monroe_1821 | force | 0.102947 |
183478 | 09_monroe_1821 | fortifications | 0.098741 |
188014 | 09_monroe_1821 | term | 0.094808 |
431422 | 10_adams_john_quincy_1825 | union | 0.257335 |
426660 | 10_adams_john_quincy_1825 | government | 0.147726 |
426595 | 10_adams_john_quincy_1825 | general | 0.109221 |
429922 | 10_adams_john_quincy_1825 | rights | 0.096300 |
425471 | 10_adams_john_quincy_1825 | dissensions | 0.095289 |
429304 | 10_adams_john_quincy_1825 | public | 0.094573 |
424714 | 10_adams_john_quincy_1825 | constitution | 0.090300 |
428754 | 10_adams_john_quincy_1825 | peace | 0.088183 |
424871 | 10_adams_john_quincy_1825 | country | 0.086898 |
428791 | 10_adams_john_quincy_1825 | performance | 0.085565 |
96341 | 11_jackson_1829 | public | 0.160747 |
93633 | 11_jackson_1829 | generally | 0.122711 |
92371 | 11_jackson_1829 | diffidence | 0.112691 |
92130 | 11_jackson_1829 | defending | 0.105878 |
97272 | 11_jackson_1829 | shall | 0.104933 |
96907 | 11_jackson_1829 | revenue | 0.102776 |
98938 | 11_jackson_1829 | worth | 0.100312 |
93697 | 11_jackson_1829 | government | 0.099698 |
93306 | 11_jackson_1829 | federal | 0.093100 |
96034 | 11_jackson_1829 | power | 0.092071 |
89460 | 12_jackson_1833 | union | 0.212766 |
84698 | 12_jackson_1833 | government | 0.207559 |
88618 | 12_jackson_1833 | states | 0.141549 |
86814 | 12_jackson_1833 | people | 0.136557 |
87114 | 12_jackson_1833 | preservation | 0.128319 |
84633 | 12_jackson_1833 | general | 0.125422 |
84083 | 12_jackson_1833 | exercise | 0.119275 |
85236 | 12_jackson_1833 | inculcate | 0.116720 |
87285 | 12_jackson_1833 | proportion | 0.116720 |
87038 | 12_jackson_1833 | powers | 0.113757 |
4427 | 13_van_buren_1837 | institutions | 0.186889 |
5823 | 13_van_buren_1837 | people | 0.138465 |
3707 | 13_van_buren_1837 | government | 0.116561 |
7872 | 13_van_buren_1837 | supposed | 0.109949 |
1918 | 13_van_buren_1837 | country | 0.109276 |
243 | 13_van_buren_1837 | actual | 0.096382 |
3144 | 13_van_buren_1837 | experience | 0.093444 |
267 | 13_van_buren_1837 | adherence | 0.083833 |
1639 | 13_van_buren_1837 | conduct | 0.081635 |
5578 | 13_van_buren_1837 | opinions | 0.081597 |
51039 | 14_harrison_1841 | power | 0.204207 |
46756 | 14_harrison_1841 | constitution | 0.183336 |
48081 | 14_harrison_1841 | executive | 0.157153 |
50818 | 14_harrison_1841 | people | 0.141584 |
48702 | 14_harrison_1841 | government | 0.141142 |
52008 | 14_harrison_1841 | roman | 0.110538 |
52622 | 14_harrison_1841 | states | 0.108621 |
46367 | 14_harrison_1841 | citizens | 0.105857 |
46300 | 14_harrison_1841 | character | 0.102640 |
52617 | 14_harrison_1841 | state | 0.094976 |
314435 | 15_polk_1845 | union | 0.259054 |
309673 | 15_polk_1845 | government | 0.256967 |
313593 | 15_polk_1845 | states | 0.218122 |
314021 | 15_polk_1845 | texas | 0.199846 |
312883 | 15_polk_1845 | revenue | 0.146541 |
312013 | 15_polk_1845 | powers | 0.124655 |
312287 | 15_polk_1845 | protection | 0.107385 |
307727 | 15_polk_1845 | constitution | 0.106528 |
310443 | 15_polk_1845 | interests | 0.105054 |
309149 | 15_polk_1845 | extended | 0.090179 |
367242 | 16_taylor_1849 | shall | 0.266204 |
363667 | 16_taylor_1849 | government | 0.118031 |
362624 | 16_taylor_1849 | duties | 0.117893 |
365432 | 16_taylor_1849 | object | 0.104293 |
361650 | 16_taylor_1849 | congress | 0.103865 |
366328 | 16_taylor_1849 | purity | 0.101793 |
368626 | 16_taylor_1849 | vested | 0.101793 |
365066 | 16_taylor_1849 | measures | 0.101637 |
361878 | 16_taylor_1849 | country | 0.101169 |
360297 | 16_taylor_1849 | affections | 0.097017 |
39837 | 17_pierce_1853 | hardly | 0.114001 |
42040 | 17_pierce_1853 | power | 0.102456 |
42011 | 17_pierce_1853 | position | 0.086643 |
37758 | 17_pierce_1853 | constitutional | 0.086105 |
39123 | 17_pierce_1853 | expect | 0.084436 |
39703 | 17_pierce_1853 | government | 0.084048 |
36538 | 17_pierce_1853 | apparent | 0.080332 |
42621 | 17_pierce_1853 | regarded | 0.080332 |
43278 | 17_pierce_1853 | shall | 0.079615 |
40859 | 17_pierce_1853 | like | 0.079229 |
358588 | 18_buchanan_1857 | states | 0.208199 |
352722 | 18_buchanan_1857 | constitution | 0.188573 |
358243 | 18_buchanan_1857 | shall | 0.161784 |
357359 | 18_buchanan_1857 | question | 0.157007 |
359805 | 18_buchanan_1857 | whilst | 0.141119 |
359006 | 18_buchanan_1857 | territory | 0.140852 |
359430 | 18_buchanan_1857 | union | 0.126444 |
354668 | 18_buchanan_1857 | government | 0.119554 |
352651 | 18_buchanan_1857 | congress | 0.118357 |
356784 | 18_buchanan_1857 | people | 0.105501 |
208738 | 19_lincoln_1861 | constitution | 0.214478 |
215446 | 19_lincoln_1861 | union | 0.203738 |
208210 | 19_lincoln_1861 | case | 0.152422 |
214604 | 19_lincoln_1861 | states | 0.144861 |
212181 | 19_lincoln_1861 | minority | 0.131514 |
212800 | 19_lincoln_1861 | people | 0.130763 |
208372 | 19_lincoln_1861 | clause | 0.125738 |
210684 | 19_lincoln_1861 | government | 0.123837 |
214259 | 19_lincoln_1861 | shall | 0.123099 |
211735 | 19_lincoln_1861 | law | 0.122872 |
413720 | 20_lincoln_1865 | war | 0.267217 |
410490 | 20_lincoln_1865 | offenses | 0.234524 |
413868 | 20_lincoln_1865 | woe | 0.234524 |
408646 | 20_lincoln_1865 | god | 0.151269 |
410489 | 20_lincoln_1865 | offense | 0.141890 |
413830 | 20_lincoln_1865 | wills | 0.141890 |
405466 | 20_lincoln_1865 | answered | 0.131631 |
412370 | 20_lincoln_1865 | slaves | 0.123674 |
413424 | 20_lincoln_1865 | union | 0.114955 |
405400 | 20_lincoln_1865 | altogether | 0.111675 |
128578 | 21_grant_1869 | dollar | 0.270439 |
131782 | 21_grant_1869 | paying | 0.162263 |
128041 | 21_grant_1869 | deal | 0.152454 |
133513 | 21_grant_1869 | specie | 0.152454 |
128056 | 21_grant_1869 | debt | 0.135097 |
127904 | 21_grant_1869 | country | 0.127604 |
126308 | 21_grant_1869 | advisable | 0.116606 |
130751 | 21_grant_1869 | laws | 0.115834 |
131784 | 21_grant_1869 | payments | 0.108175 |
131780 | 21_grant_1869 | pay | 0.098658 |
303269 | 22_grant_1873 | proposition | 0.187222 |
299570 | 22_grant_1873 | domingo | 0.177516 |
304058 | 22_grant_1873 | santo | 0.177516 |
305193 | 22_grant_1873 | transit | 0.177516 |
305012 | 22_grant_1873 | territory | 0.121158 |
300160 | 22_grant_1873 | extermination | 0.118344 |
304617 | 22_grant_1873 | steam | 0.118344 |
304969 | 22_grant_1873 | telegraph | 0.118344 |
298885 | 22_grant_1873 | country | 0.117529 |
300153 | 22_grant_1873 | extension | 0.116618 |
397874 | 23_hayes_1877 | country | 0.186357 |
399663 | 23_hayes_1877 | government | 0.167722 |
396868 | 23_hayes_1877 | behalf | 0.128316 |
402307 | 23_hayes_1877 | public | 0.123944 |
401944 | 23_hayes_1877 | political | 0.121034 |
403583 | 23_hayes_1877 | states | 0.113587 |
401713 | 23_hayes_1877 | party | 0.112549 |
398461 | 23_hayes_1877 | dispute | 0.112503 |
401706 | 23_hayes_1877 | parties | 0.109554 |
402561 | 23_hayes_1877 | reform | 0.104365 |
291675 | 24_garfield_1881 | government | 0.186855 |
293791 | 24_garfield_1881 | people | 0.162132 |
289729 | 24_garfield_1881 | constitution | 0.158292 |
295595 | 24_garfield_1881 | states | 0.135047 |
296437 | 24_garfield_1881 | union | 0.132321 |
295778 | 24_garfield_1881 | suffrage | 0.119992 |
293368 | 24_garfield_1881 | negro | 0.118782 |
288756 | 24_garfield_1881 | authority | 0.117232 |
289658 | 24_garfield_1881 | congress | 0.112598 |
292726 | 24_garfield_1881 | law | 0.103639 |
68816 | 25_cleveland_1885 | people | 0.210468 |
66700 | 25_cleveland_1885 | government | 0.209164 |
68744 | 25_cleveland_1885 | partisan | 0.169436 |
69344 | 25_cleveland_1885 | public | 0.163662 |
70275 | 25_cleveland_1885 | shall | 0.129498 |
64754 | 25_cleveland_1885 | constitution | 0.127856 |
67470 | 25_cleveland_1885 | interests | 0.118207 |
66197 | 25_cleveland_1885 | extravagance | 0.111416 |
64363 | 25_cleveland_1885 | citizen | 0.102825 |
70708 | 25_cleveland_1885 | strife | 0.101661 |
383781 | 26_harrison_1889 | people | 0.172358 |
382723 | 26_harrison_1889 | laws | 0.154418 |
385585 | 26_harrison_1889 | states | 0.138614 |
378804 | 26_harrison_1889 | ballot | 0.137159 |
384309 | 26_harrison_1889 | public | 0.128566 |
383119 | 26_harrison_1889 | methods | 0.119162 |
385240 | 26_harrison_1889 | shall | 0.118483 |
381527 | 26_harrison_1889 | friendly | 0.104267 |
380968 | 26_harrison_1889 | european | 0.103349 |
379719 | 26_harrison_1889 | constitution | 0.089360 |
446774 | 27_cleveland_1893 | people | 0.221563 |
444658 | 27_cleveland_1893 | government | 0.148364 |
444533 | 27_cleveland_1893 | frugality | 0.128050 |
447302 | 27_cleveland_1893 | public | 0.102520 |
448203 | 27_cleveland_1893 | service | 0.101813 |
448817 | 27_cleveland_1893 | support | 0.099946 |
441413 | 27_cleveland_1893 | american | 0.097267 |
441192 | 27_cleveland_1893 | activity | 0.095964 |
444659 | 27_cleveland_1893 | governmental | 0.095964 |
442870 | 27_cleveland_1893 | countrymen | 0.088564 |
280659 | 28_mckinley_1897 | congress | 0.188773 |
285886 | 28_mckinley_1897 | revenue | 0.168489 |
284792 | 28_mckinley_1897 | people | 0.161797 |
282676 | 28_mckinley_1897 | government | 0.156633 |
283868 | 28_mckinley_1897 | loans | 0.149356 |
283766 | 28_mckinley_1897 | legislation | 0.126367 |
285320 | 28_mckinley_1897 | public | 0.107057 |
280123 | 28_mckinley_1897 | business | 0.106759 |
282710 | 28_mckinley_1897 | great | 0.105322 |
285900 | 28_mckinley_1897 | revision | 0.099571 |
229571 | 29_mckinley_1901 | islands | 0.216480 |
226964 | 29_mckinley_1901 | cuba | 0.206329 |
228682 | 29_mckinley_1901 | government | 0.153681 |
228061 | 29_mckinley_1901 | executive | 0.147843 |
229329 | 29_mckinley_1901 | inhabitants | 0.147374 |
226665 | 29_mckinley_1901 | congress | 0.141999 |
230798 | 29_mckinley_1901 | people | 0.116839 |
232602 | 29_mckinley_1901 | states | 0.102186 |
233447 | 29_mckinley_1901 | united | 0.100439 |
231074 | 29_mckinley_1901 | preparation | 0.097925 |
168610 | 30_roosevelt_theodore_1905 | regards | 0.199163 |
168177 | 30_roosevelt_theodore_1905 | problems | 0.182463 |
169958 | 30_roosevelt_theodore_1905 | tasks | 0.150068 |
162600 | 30_roosevelt_theodore_1905 | aright | 0.146306 |
168765 | 30_roosevelt_theodore_1905 | republic | 0.121428 |
166828 | 30_roosevelt_theodore_1905 | life | 0.118701 |
163230 | 30_roosevelt_theodore_1905 | cause | 0.116483 |
165199 | 30_roosevelt_theodore_1905 | faced | 0.115730 |
163619 | 30_roosevelt_theodore_1905 | conditions | 0.115373 |
170877 | 30_roosevelt_theodore_1905 | wish | 0.106771 |
517444 | 31_taft_1909 | interstate | 0.206957 |
514097 | 31_taft_1909 | business | 0.201378 |
520913 | 31_taft_1909 | tariff | 0.154802 |
518343 | 31_taft_1909 | negro | 0.153669 |
520442 | 31_taft_1909 | south | 0.129384 |
516650 | 31_taft_1909 | government | 0.121451 |
519229 | 31_taft_1909 | proper | 0.114684 |
519354 | 31_taft_1909 | race | 0.113413 |
516265 | 31_taft_1909 | feeling | 0.111255 |
514129 | 31_taft_1909 | canal | 0.110879 |
201719 | 32_wilson_1913 | great | 0.158659 |
203109 | 32_wilson_1913 | men | 0.142924 |
201245 | 32_wilson_1913 | familiar | 0.141669 |
205651 | 32_wilson_1913 | stirred | 0.141669 |
205718 | 32_wilson_1913 | studied | 0.141669 |
206055 | 32_wilson_1913 | things | 0.123915 |
202644 | 32_wilson_1913 | justice | 0.105759 |
201685 | 32_wilson_1913 | government | 0.105520 |
202824 | 32_wilson_1913 | life | 0.102168 |
202901 | 32_wilson_1913 | look | 0.100095 |
152880 | 33_wilson_1917 | wished | 0.228593 |
145888 | 33_wilson_1917 | counsel | 0.174639 |
150355 | 33_wilson_1917 | purpose | 0.152933 |
144219 | 33_wilson_1917 | action | 0.149960 |
151266 | 33_wilson_1917 | shall | 0.134404 |
152075 | 33_wilson_1917 | thought | 0.126568 |
151593 | 33_wilson_1917 | stand | 0.121313 |
151243 | 33_wilson_1917 | set | 0.111215 |
149976 | 33_wilson_1917 | politics | 0.108510 |
146621 | 33_wilson_1917 | drawn | 0.104783 |
251911 | 34_harding_1921 | world | 0.196268 |
244352 | 34_harding_1921 | civilization | 0.157095 |
243434 | 34_harding_1921 | america | 0.155684 |
251738 | 34_harding_1921 | war | 0.120631 |
249645 | 34_harding_1921 | relationship | 0.118846 |
249756 | 34_harding_1921 | republic | 0.117160 |
248578 | 34_harding_1921 | order | 0.110128 |
251362 | 34_harding_1921 | understanding | 0.109915 |
248386 | 34_harding_1921 | new | 0.097567 |
243442 | 34_harding_1921 | amid | 0.095077 |
262889 | 35_coolidge_1925 | country | 0.120814 |
266602 | 35_coolidge_1925 | ought | 0.116721 |
267750 | 35_coolidge_1925 | represents | 0.114495 |
268950 | 35_coolidge_1925 | tax | 0.112826 |
264712 | 35_coolidge_1925 | great | 0.109908 |
267259 | 35_coolidge_1925 | property | 0.108285 |
266728 | 35_coolidge_1925 | party | 0.107300 |
268585 | 35_coolidge_1925 | stands | 0.107257 |
266772 | 35_coolidge_1925 | peace | 0.104170 |
266794 | 35_coolidge_1925 | people | 0.101306 |
106831 | 36_hoover_1929 | sup | 0.296865 |
102696 | 36_hoover_1929 | government | 0.202690 |
101845 | 36_hoover_1929 | enforcement | 0.194371 |
99049 | 36_hoover_1929 | 18th | 0.134706 |
105231 | 36_hoover_1929 | progress | 0.132406 |
102305 | 36_hoover_1929 | federal | 0.126183 |
103050 | 36_hoover_1929 | ideals | 0.113418 |
100143 | 36_hoover_1929 | business | 0.108323 |
103754 | 36_hoover_1929 | laws | 0.107051 |
104790 | 36_hoover_1929 | peace | 0.103883 |
345889 | 37_roosevelt_franklin_1933 | helped | 0.215644 |
346734 | 37_roosevelt_franklin_1933 | leadership | 0.191084 |
349671 | 37_roosevelt_franklin_1933 | stricken | 0.129390 |
344742 | 37_roosevelt_franklin_1933 | emergency | 0.123225 |
344399 | 37_roosevelt_franklin_1933 | discipline | 0.117971 |
348804 | 37_roosevelt_franklin_1933 | respects | 0.117971 |
347233 | 37_roosevelt_franklin_1933 | money | 0.113371 |
347316 | 37_roosevelt_franklin_1933 | national | 0.110570 |
348520 | 37_roosevelt_franklin_1933 | recovery | 0.102349 |
342197 | 37_roosevelt_franklin_1933 | action | 0.097007 |
335172 | 38_roosevelt_franklin_1937 | democracy | 0.178041 |
336670 | 38_roosevelt_franklin_1937 | government | 0.177222 |
338146 | 38_roosevelt_franklin_1937 | millions | 0.140722 |
338671 | 38_roosevelt_franklin_1937 | paint | 0.121461 |
338786 | 38_roosevelt_franklin_1937 | people | 0.115789 |
335664 | 38_roosevelt_franklin_1937 | economic | 0.114184 |
339954 | 38_roosevelt_franklin_1937 | road | 0.112944 |
339205 | 38_roosevelt_franklin_1937 | progress | 0.104463 |
335255 | 38_roosevelt_franklin_1937 | despair | 0.100302 |
338316 | 38_roosevelt_franklin_1937 | nation | 0.099688 |
272179 | 39_roosevelt_franklin_1941 | democracy | 0.244486 |
274674 | 39_roosevelt_franklin_1941 | know | 0.189060 |
277494 | 39_roosevelt_franklin_1941 | speaks | 0.183385 |
271040 | 39_roosevelt_franklin_1941 | br | 0.163241 |
275323 | 39_roosevelt_franklin_1941 | nation | 0.162241 |
270431 | 39_roosevelt_franklin_1941 | america | 0.140133 |
274816 | 39_roosevelt_franklin_1941 | life | 0.117815 |
277525 | 39_roosevelt_franklin_1941 | spirit | 0.114445 |
273521 | 39_roosevelt_franklin_1941 | freedom | 0.109295 |
270043 | 39_roosevelt_franklin_1941 | 1941 | 0.108911 |
472725 | 40_roosevelt_franklin_1945 | learned | 0.300396 |
475997 | 40_roosevelt_franklin_1945 | test | 0.194731 |
468022 | 40_roosevelt_franklin_1945 | 1945 | 0.189849 |
475230 | 40_roosevelt_franklin_1945 | shall | 0.173637 |
476211 | 40_roosevelt_franklin_1945 | trend | 0.172292 |
473184 | 40_roosevelt_franklin_1945 | mistakes | 0.159835 |
473749 | 40_roosevelt_franklin_1945 | peace | 0.159442 |
476096 | 40_roosevelt_franklin_1945 | today | 0.154299 |
476538 | 40_roosevelt_franklin_1945 | upward | 0.150172 |
471569 | 40_roosevelt_franklin_1945 | gain | 0.142278 |
143923 | 41_truman_1949 | world | 0.196051 |
140343 | 41_truman_1949 | nations | 0.194029 |
141225 | 41_truman_1949 | program | 0.171656 |
140809 | 41_truman_1949 | peoples | 0.166989 |
137194 | 41_truman_1949 | democracy | 0.154140 |
138536 | 41_truman_1949 | freedom | 0.149297 |
136513 | 41_truman_1949 | communism | 0.147134 |
140786 | 41_truman_1949 | peace | 0.144167 |
136902 | 41_truman_1949 | countries | 0.137013 |
141543 | 41_truman_1949 | recovery | 0.135785 |
462497 | 42_eisenhower_1953 | free | 0.205803 |
462199 | 42_eisenhower_1953 | faith | 0.154561 |
467887 | 42_eisenhower_1953 | world | 0.146449 |
464773 | 42_eisenhower_1953 | peoples | 0.139466 |
465172 | 42_eisenhower_1953 | productivity | 0.133826 |
466648 | 42_eisenhower_1953 | strength | 0.130430 |
464750 | 42_eisenhower_1953 | peace | 0.123845 |
462500 | 42_eisenhower_1953 | freedom | 0.123320 |
466231 | 42_eisenhower_1953 | shall | 0.105970 |
462919 | 42_eisenhower_1953 | hold | 0.105028 |
485885 | 43_eisenhower_1957 | world | 0.193893 |
480498 | 43_eisenhower_1957 | freedom | 0.179599 |
484143 | 43_eisenhower_1957 | seek | 0.176008 |
482305 | 43_eisenhower_1957 | nations | 0.175538 |
482771 | 43_eisenhower_1957 | peoples | 0.158270 |
484670 | 43_eisenhower_1957 | strives | 0.146428 |
482748 | 43_eisenhower_1957 | peace | 0.136639 |
480873 | 43_eisenhower_1957 | help | 0.132688 |
479515 | 43_eisenhower_1957 | divided | 0.115450 |
482262 | 43_eisenhower_1957 | mr | 0.111886 |
391774 | 44_kennedy_1961 | let | 0.267869 |
394306 | 44_kennedy_1961 | sides | 0.262849 |
392921 | 44_kennedy_1961 | pledge | 0.160960 |
387632 | 44_kennedy_1961 | ask | 0.107713 |
387864 | 44_kennedy_1961 | begin | 0.106495 |
388991 | 44_kennedy_1961 | dare | 0.106495 |
395895 | 44_kennedy_1961 | world | 0.103110 |
390313 | 44_kennedy_1961 | final | 0.102311 |
392370 | 44_kennedy_1961 | new | 0.096600 |
390120 | 44_kennedy_1961 | explore | 0.094223 |
109283 | 45_johnson_1965 | change | 0.276090 |
109919 | 45_johnson_1965 | covenant | 0.242891 |
113001 | 45_johnson_1965 | man | 0.174391 |
113063 | 45_johnson_1965 | mastery | 0.153532 |
113341 | 45_johnson_1965 | nation | 0.152475 |
116457 | 45_johnson_1965 | union | 0.150512 |
113540 | 45_johnson_1965 | old | 0.129184 |
116299 | 45_johnson_1965 | trying | 0.109663 |
113811 | 45_johnson_1965 | people | 0.108677 |
111846 | 45_johnson_1965 | harvest | 0.102355 |
458677 | 46_nixon_1969 | voices | 0.208854 |
455751 | 46_nixon_1969 | peace | 0.144624 |
454767 | 46_nixon_1969 | let | 0.140977 |
452636 | 46_nixon_1969 | earth | 0.139513 |
454654 | 46_nixon_1969 | know | 0.137969 |
454963 | 46_nixon_1969 | man | 0.135416 |
455773 | 46_nixon_1969 | people | 0.131270 |
458888 | 46_nixon_1969 | world | 0.128264 |
456896 | 46_nixon_1969 | rhetoric | 0.119219 |
453462 | 46_nixon_1969 | forward | 0.113215 |
9460 | 47_nixon_1973 | america | 0.307074 |
13816 | 47_nixon_1973 | let | 0.282212 |
14800 | 47_nixon_1973 | peace | 0.211567 |
16008 | 47_nixon_1973 | role | 0.190395 |
17937 | 47_nixon_1973 | world | 0.177760 |
14983 | 47_nixon_1973 | policies | 0.176224 |
15848 | 47_nixon_1973 | responsibility | 0.164016 |
14412 | 47_nixon_1973 | new | 0.158606 |
9147 | 47_nixon_1973 | abroad | 0.154815 |
12977 | 47_nixon_1973 | home | 0.126653 |
190049 | 48_carter_1977 | br | 0.222574 |
194332 | 48_carter_1977 | nation | 0.191717 |
191619 | 48_carter_1977 | dream | 0.181515 |
196678 | 48_carter_1977 | strength | 0.147104 |
194392 | 48_carter_1977 | new | 0.142111 |
194143 | 48_carter_1977 | micah | 0.118797 |
197040 | 48_carter_1977 | thee | 0.107811 |
196534 | 48_carter_1977 | spirit | 0.107000 |
193001 | 48_carter_1977 | human | 0.101203 |
191853 | 48_carter_1977 | enhance | 0.100016 |
156690 | 49_reagan_1981 | government | 0.162397 |
153447 | 49_reagan_1981 | americans | 0.156895 |
156925 | 49_reagan_1981 | heroes | 0.137410 |
153904 | 49_reagan_1981 | believe | 0.136126 |
161636 | 49_reagan_1981 | ve | 0.115339 |
159206 | 49_reagan_1981 | productivity | 0.104753 |
161801 | 49_reagan_1981 | weapon | 0.104753 |
156534 | 49_reagan_1981 | freedom | 0.102964 |
155625 | 49_reagan_1981 | dreams | 0.101106 |
161131 | 49_reagan_1981 | today | 0.093813 |
21705 | 50_reagan_1985 | government | 0.161165 |
21549 | 50_reagan_1985 | freedom | 0.159998 |
23452 | 50_reagan_1985 | nuclear | 0.153623 |
26651 | 50_reagan_1985 | ve | 0.153623 |
26817 | 50_reagan_1985 | weapons | 0.140173 |
23821 | 50_reagan_1985 | people | 0.137038 |
26936 | 50_reagan_1985 | world | 0.127236 |
21963 | 50_reagan_1985 | history | 0.104777 |
22020 | 50_reagan_1985 | human | 0.104777 |
25219 | 50_reagan_1985 | senator | 0.102416 |
119593 | 51_bush_george_h_w_1989 | don | 0.186313 |
118075 | 51_bush_george_h_w_1989 | breeze | 0.184416 |
122400 | 51_bush_george_h_w_1989 | new | 0.137266 |
120557 | 51_bush_george_h_w_1989 | friends | 0.136820 |
119597 | 51_bush_george_h_w_1989 | door | 0.133889 |
125912 | 51_bush_george_h_w_1989 | word | 0.131722 |
122302 | 51_bush_george_h_w_1989 | mr | 0.126821 |
120800 | 51_bush_george_h_w_1989 | hand | 0.125086 |
118005 | 51_bush_george_h_w_1989 | blowing | 0.110649 |
125064 | 51_bush_george_h_w_1989 | things | 0.110609 |
252433 | 52_clinton_1993 | america | 0.318908 |
260910 | 52_clinton_1993 | world | 0.226715 |
252436 | 52_clinton_1993 | americans | 0.206865 |
260120 | 52_clinton_1993 | today | 0.185539 |
253267 | 52_clinton_1993 | change | 0.170522 |
258709 | 52_clinton_1993 | renewal | 0.136867 |
259135 | 52_clinton_1993 | season | 0.136867 |
256028 | 52_clinton_1993 | idea | 0.134993 |
256789 | 52_clinton_1993 | let | 0.132521 |
257795 | 52_clinton_1993 | people | 0.129272 |
28273 | 53_clinton_1997 | century | 0.321300 |
32410 | 53_clinton_1997 | new | 0.279600 |
27458 | 53_clinton_1997 | america | 0.199997 |
33257 | 53_clinton_1997 | promise | 0.164327 |
35935 | 53_clinton_1997 | world | 0.135071 |
31724 | 53_clinton_1997 | land | 0.131027 |
32350 | 53_clinton_1997 | nation | 0.117062 |
27461 | 53_clinton_1997 | americans | 0.115029 |
35131 | 53_clinton_1997 | time | 0.108057 |
31814 | 53_clinton_1997 | let | 0.105270 |
322651 | 54_bush_george_w_2001 | story | 0.341166 |
315426 | 54_bush_george_w_2001 | america | 0.193152 |
316343 | 54_bush_george_w_2001 | civility | 0.160853 |
320318 | 54_bush_george_w_2001 | nation | 0.130448 |
315304 | 54_bush_george_w_2001 | affirm | 0.120640 |
319026 | 54_bush_george_w_2001 | ideals | 0.109491 |
315429 | 54_bush_george_w_2001 | americans | 0.108207 |
321225 | 54_bush_george_w_2001 | promise | 0.108207 |
316509 | 54_bush_george_w_2001 | compassion | 0.107388 |
316337 | 54_bush_george_w_2001 | citizens | 0.106730 |
435503 | 55_bush_george_w_2005 | freedom | 0.349948 |
432413 | 55_bush_george_w_2005 | america | 0.284882 |
436792 | 55_bush_george_w_2005 | liberty | 0.174494 |
432416 | 55_bush_george_w_2005 | americans | 0.140443 |
440278 | 55_bush_george_w_2005 | tyranny | 0.127272 |
439154 | 55_bush_george_w_2005 | seen | 0.110386 |
437305 | 55_bush_george_w_2005 | nation | 0.096199 |
433200 | 55_bush_george_w_2005 | cause | 0.092545 |
435917 | 55_bush_george_w_2005 | history | 0.092422 |
433132 | 55_bush_george_w_2005 | came | 0.091988 |
54455 | 56_obama_2009 | america | 0.148351 |
59347 | 56_obama_2009 | nation | 0.120229 |
59407 | 56_obama_2009 | new | 0.118002 |
62142 | 56_obama_2009 | today | 0.114792 |
57639 | 56_obama_2009 | generation | 0.100654 |
58811 | 56_obama_2009 | let | 0.091100 |
58627 | 56_obama_2009 | jobs | 0.090727 |
55960 | 56_obama_2009 | crisis | 0.087235 |
57828 | 56_obama_2009 | hard | 0.084859 |
62910 | 56_obama_2009 | women | 0.084859 |
418595 | 57_obama_2013 | journey | 0.167591 |
415909 | 57_obama_2013 | creed | 0.139659 |
417599 | 57_obama_2013 | generation | 0.127260 |
414415 | 57_obama_2013 | america | 0.125044 |
415519 | 57_obama_2013 | complete | 0.114891 |
420751 | 57_obama_2013 | requires | 0.114891 |
419777 | 57_obama_2013 | people | 0.110351 |
422088 | 57_obama_2013 | time | 0.105563 |
422102 | 57_obama_2013 | today | 0.103668 |
416980 | 57_obama_2013 | evident | 0.100896 |
504405 | 58_trump_2017 | america | 0.350162 |
506586 | 58_trump_2017 | dreams | 0.156436 |
504406 | 58_trump_2017 | american | 0.149226 |
508577 | 58_trump_2017 | jobs | 0.142766 |
510263 | 58_trump_2017 | protected | 0.132439 |
509410 | 58_trump_2017 | obama | 0.120288 |
509767 | 58_trump_2017 | people | 0.112370 |
512002 | 58_trump_2017 | thank | 0.109171 |
504990 | 58_trump_2017 | borders | 0.107075 |
512597 | 58_trump_2017 | ve | 0.107075 |
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)
We can zoom in on particular words and particular documents.
top_tfidf[top_tfidf['term'].str.contains('women')]
document | term | tfidf | |
---|---|---|---|
62910 | 56_obama_2009 | women | 0.084859 |
It turns out that the term “women” is very distinctive in Obama’s Inaugural Address.
top_tfidf[top_tfidf['document'].str.contains('obama')]
document | term | tfidf | |
---|---|---|---|
54455 | 56_obama_2009 | america | 0.148351 |
59347 | 56_obama_2009 | nation | 0.120229 |
59407 | 56_obama_2009 | new | 0.118002 |
62142 | 56_obama_2009 | today | 0.114792 |
57639 | 56_obama_2009 | generation | 0.100654 |
58811 | 56_obama_2009 | let | 0.091100 |
58627 | 56_obama_2009 | jobs | 0.090727 |
55960 | 56_obama_2009 | crisis | 0.087235 |
57828 | 56_obama_2009 | hard | 0.084859 |
62910 | 56_obama_2009 | women | 0.084859 |
418595 | 57_obama_2013 | journey | 0.167591 |
415909 | 57_obama_2013 | creed | 0.139659 |
417599 | 57_obama_2013 | generation | 0.127260 |
414415 | 57_obama_2013 | america | 0.125044 |
415519 | 57_obama_2013 | complete | 0.114891 |
420751 | 57_obama_2013 | requires | 0.114891 |
419777 | 57_obama_2013 | people | 0.110351 |
422088 | 57_obama_2013 | time | 0.105563 |
422102 | 57_obama_2013 | today | 0.103668 |
416980 | 57_obama_2013 | evident | 0.100896 |
top_tfidf[top_tfidf['document'].str.contains('trump')]
document | term | tfidf | |
---|---|---|---|
504405 | 58_trump_2017 | america | 0.350162 |
506586 | 58_trump_2017 | dreams | 0.156436 |
504406 | 58_trump_2017 | american | 0.149226 |
508577 | 58_trump_2017 | jobs | 0.142766 |
510263 | 58_trump_2017 | protected | 0.132439 |
509410 | 58_trump_2017 | obama | 0.120288 |
509767 | 58_trump_2017 | people | 0.112370 |
512002 | 58_trump_2017 | thank | 0.109171 |
504990 | 58_trump_2017 | borders | 0.107075 |
512597 | 58_trump_2017 | ve | 0.107075 |
top_tfidf[top_tfidf['document'].str.contains('kennedy')]
document | term | tfidf | |
---|---|---|---|
391774 | 44_kennedy_1961 | let | 0.267869 |
394306 | 44_kennedy_1961 | sides | 0.262849 |
392921 | 44_kennedy_1961 | pledge | 0.160960 |
387632 | 44_kennedy_1961 | ask | 0.107713 |
387864 | 44_kennedy_1961 | begin | 0.106495 |
388991 | 44_kennedy_1961 | dare | 0.106495 |
395895 | 44_kennedy_1961 | world | 0.103110 |
390313 | 44_kennedy_1961 | final | 0.102311 |
392370 | 44_kennedy_1961 | new | 0.096600 |
390120 | 44_kennedy_1961 | explore | 0.094223 |
Visualize TF-IDF#
We can also visualize our TF-IDF results with the data visualization library Altair.
!pip install altair
Let’s make a heatmap that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: “war” and “peace”:
The code below was contributed by Eric Monson. Thanks, Eric!
Show code cell source
import altair as alt
import numpy as np
# Terms in this list will get a red dot in the visualization
term_list = ['war', 'peace']
# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001
# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
x = 'rank:O',
y = 'document:N'
).transform_window(
rank = "rank()",
sort = [alt.SortField("tfidf", order="descending")],
groupby = ["document"],
)
# heatmap specification
heatmap = base.mark_rect().encode(
color = 'tfidf:Q'
)
# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
color = alt.condition(
alt.FieldOneOfPredicate(field='term', oneOf=term_list),
alt.value('red'),
alt.value('#FFFFFF00')
)
)
# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
text = 'term:N',
color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)
# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)
Your Turn!#
Take a few minutes to explore the dataframe below and then answer the following questions.
1. What is the difference between a tf-idf score and raw word frequency?
2. Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?
3. What’s another collection of texts that you think might be interesting to analyze with tf-idf scores? Why?