Topic Modeling — Overview

In the next lessons, we’re going to learn about a text analysis method called topic modeling. This method will help us identify the main topics or discourses within a collection of texts or within a single text that has been separated into smaller text chunks.

When we calculated term frequency-inverse document frequency (tf-idf) scores, we identified individual words that were statistically meaningful for certain documents (i.e., they were more likely to show up in certain documents rather than in other ones). When we topic model, we’re going to identify clusters of words that show up together in statistically meaningful ways throughout the corpus.

Why Are Topic Models Useful?

Topic models are useful for understanding collections of texts in their broadest outlines and themes. If you wanted to get a sense of the primary subjects discussed in thousands of journal articles published over multiple decades, then topic modeling might be a good choice. Topic models can also be helpful for looking at the fluctuation of topics and themes over time (this is a time series approach that we will discuss in the next lesson) or finding clusters of texts that contain the same or similar topics.

In these lessons, we will train a topic model on a collection of 378 obituaries published by The New York Times. As we’ll find out, the topics—or clusters of words—that emerge from this analysis are broadly related to art, literature, sports, the military, and politics, among other subjects. These results are pretty satisfying! They seem to capture what the obituaries are “about.” Frida Kahlo’s obituary contains a significant discussion of her art, while Nella Larsen’s obituary discusses her novels, Jackie Robinson’s obituary discusses his role in sports, and Ulysses S. Grant’s obituary discusses his military career. How does the topic model “know” or “discover” what these NYT obituaries are about?

How LDA Topic Models Work

There are numerous kinds of topic models, but the most popular and widely-used kind is latent Dirichlet allocation (LDA). It’s so popular, in fact, that “LDA” and “topic model” are sometimes used interchangeably, even though LDA is only one type.

LDA math is pretty complicated. We’re not going to get very deep into the math here. But we are going to introduce two important concepts that will help us conceptually understand how LDA topic models work.

1) LDA is an Unsupervised Algorithm

Topic modeling is a kind of machine learning. Machine learning always sounds like a fancy, scary term, but it really just means that computer algorithms are performing tasks without being explicitly programmed to do so and that they are “learning” how to perform these tasks by being fed training data. In the field of machine learning, algorithms are typically split into two broad categories: supervised and unsupervised. These categories describe how the algorithms are “trained” or how they “learn.” LDA is an unsupervised algorithm.

If an algorithm is supervised, that means a researcher is helping to guide it with some kind of information, like labels. For example, if you wanted to create an algorithm that could identify pictures of cats vs pictures of dogs, you could train it with a bunch of pictures of cats that were clearly labeled CAT and a bunch of pictures of dogs that were clearly labeled DOG. The algorithm would then be able to learn which features are specific to cats vs dogs because you explicitly told it: this is a picture of a cat; this is a picture of a dog.

If an algorithm is unsupervised, that means a researcher does not train it with outside information. There are no labels. The algorithm just learns that pictures of cats are more similar to each other and pictures of dogs are more similar to each other. The algorithm doesn’t really know that one cluster is cats and one cluster is dogs; it just knows that there are two distinct clusters.

Because LDA is an unsupervised algorithm, we don’t tell our topic model which words or topics to look for. We only tell the topic model how many topics (or clusters of words) that we want returned. The topic model doesn’t know anything about Frida Kahlo, Nella Larsen, and Jackie Robinson. It doesn’t know anything about art, literature, and sports.

2) LDA is a Probabilistic Model

LDA fundamentally relies on statistics and probabilities. Rather than calculating precise and unchanging metrics about a given corpus, a topic model makes a series of very sophisticated guesses about the corpus. These guesses will change slightly every time we run the topic model. This is important to remember as we analyze, interpret, and make arguments based on our results. All of our results in this lesson will be probabilities, and they’ll change slightly every time we re-run the topic model.

When we tell the topic model that we want to extract 15 topics from the NYT obituaries, here’s what the topic model does:

The topic model starts off with a slightly silly, backwards assumption. The topic model assumes that every single one of the 378 obituaries in the corpus was written by someone who exclusively drew their words from 15 mystery topics, or 15 clusters of words. To spin it in a slightly different way with a different medium, the topic model assumes that there was one master artist with 15 different paints on her palette, who created all the NYT obituaries by dipping her brush into these 15 paints alone, applying and blending them onto each canvas in different proportions. The topic model is trying to discover the 15 mystery topics that created all the NYT obituaries, as well as the mixture of these topics that makes up each individual NYT obituary.

The topic model begins by taking a completely wild guess about the 15 topics, but then it iterates through all the words in all the documents and makes better and better guesses. If the word “book” keeps showing up with the words “published” and “writing,” and if all three words keep showing up in the same kinds of NYT obituaries, then the topic model starts to suspect that these three words should belong to the same topic. If the word “film” keeps showing up with “Hollywood” and “actor,” then the topic model suspects that they should belong to the same topic, too. The topic model finally arrives at its best guesses for the 15 topics that most likely created all the NYT obituaries.

Topics and Labels

What we call a “topic” is really just a list of the most probable words for that topic, which are sorted in descending order of probability. The most probable word for the topic is the first word. Here are some sample topics from a previous run on the NYT obituaries:

✨Topic 3✨

[‘book’, ‘new’, ‘wrote’, ‘work’, ‘published’, ‘art’, ‘years’, ‘writing’, ‘world’, ‘writer’, ‘books’, ‘novel’, ‘paris’, ‘life’, ‘story’]

✨Topic 6✨

[‘president’, ‘state’, ‘court’, ‘roosevelt’, ‘justice’, ‘house’, ‘years’, ‘law’, ‘party’, ‘political’, ‘republican’, ‘senator’, ‘governor’, ‘democratic’, ‘campaign’]

✨Topic 10✨

[‘miss’, ‘film’, ‘years’, ‘theater’, ‘broadway’, ‘movie’, ‘films’, ‘hollywood’, ‘stage’, ‘movies’, ‘actor’, ‘new’, ‘director’, ‘york’, ‘show’]

Topic models start to get more powerful when we, as human researchers, analyze the most probable words for every topic and summarize what these words have in common. This summary can then be used as a descriptive label for the topic. Remember, since an LDA topic model is an unsupervised algorithm, it doesn’t know what these words mean in relationship to one another. It’s up to us, as the human researchers, to make meaning out of the topics.

For example, we might label the topics as follows:

✨Topic 3: Literature

[‘book’, ‘new’, ‘wrote’, ‘work’, ‘published’, ‘art’, ‘years’, ‘writing’, ‘world’, ‘writer’, ‘books’, ‘novel’, ‘paris’, ‘life’, ‘story’]

✨Topic 6: Politics

[‘president’, ‘state’, ‘court’, ‘roosevelt’, ‘justice’, ‘house’, ‘years’, ‘law’, ‘party’, ‘political’, ‘republican’, ‘senator’, ‘governor’, ‘democratic’, ‘campaign’]

✨Topic 10: Hollywood

[‘miss’, ‘film’, ‘years’, ‘theater’, ‘broadway’, ‘movie’, ‘films’, ‘hollywood’, ‘stage’, ‘movies’, ‘actor’, ‘new’, ‘director’, ‘york’, ‘show’]

However, even when the topics are relatively straightforward, as these topics seem to be, they’re still open to interpretation. For instance, we could just as easily label Topic 3 “Writing & Art,” Topic 6 “Government,” and Topic 10 “Film & Theater.” These subtle changes might shape the direction of our analysis and eventual argument in substantial ways. Topics can be far more ambiguous than the above examples, as well, which makes the business of interpretation even more significant. We’ll discuss the ambiguity of topics and topic labels in more depth in the next lessons.