TF-IDF

In the next lessons, we’re going to learn about a text analysis method called term frequency–inverse document frequency, often abbreviated tf-idf.

While calculating the most frequent words in a text can be useful, the most frequent words in a text usually aren’t the most interesting words in a text, even if we get rid of stop words (“the, “and,” “to,” etc.). Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent and significant words.

If you already have a collection of plain text (.txt) files that you’d like to analyze, one of the easiest ways to calculate tf-idf scores is to use the Python library scikit-learn. It has a quick and nifty module called TfidfVectorizer, which does all the math for you behind the scenes.

We will also cover how to calculate tf-idf scores for in-copyright texts using extracted features from the HathiTrust Digital Library, which contains digitized books from Google Books as well as many university libraries.