{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic Modeling — With Tomotopy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In these lessons, we're learning about a text analysis method called *topic modeling*. This method will help us identify the main topics or discourses within a collection of texts or single text that has been separated into smaller text chunks.\n", "\n", "In this particular lesson, we're going to use [Tomotopy](https://github.com/bab2min/tomotopy) to topic model 379 obituaries published by *The New York Times*.\n", "\n", "While Mallet is a fantastic tool that is widely embraced throughout the DH community, it can also pose challenges for scholars because it requires the installation and configuration of Mallet/the Java Development Kit. Tomotopy is a topic modeling tool that is written purely in Python, and it seems to be a good alternative to Mallet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *New York Times* Obituaries\n", "\n", "
\n", "\n", " Georgia O'Keeffe, the undisputed doyenne of American painting and a leader, with her husband, Alfred Stieglitz, of a crucial phase in the development and dissemination of American modernism, died yesterday at St. Vincent Hospital in Santa Fe, N.M.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is based on data originally collected by Matt Lavin for his *Programming Historian* [TF-IDF tutorial](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset). I have re-scraped the obituaries so that the subject's name and death year is included in each text file name, and I have added 13 more [\"Overlooked\"](https://www.nytimes.com/interactive/2018/obituaries/overlooked.html) obituaries, including [Karen Spärck Jones](https://www.nytimes.com/2019/01/02/obituaries/karen-sparck-jones-overlooked.html), the computer scientist who introduced TF-IDF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting tomotopy\n", " Obtaining dependency information for tomotopy from https://files.pythonhosted.org/packages/08/fb/a2dbd672ff5858834c20dae32b6ed5deaa7c5da8f5e4733b1202eaa3dd6f/tomotopy-0.12.7-cp311-cp311-macosx_11_0_arm64.whl.metadata\n", " Downloading tomotopy-0.12.7-cp311-cp311-macosx_11_0_arm64.whl.metadata (29 kB)\n", "Requirement already satisfied: numpy>=1.11.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from tomotopy) (1.24.3)\n", "Downloading tomotopy-0.12.7-cp311-cp311-macosx_11_0_arm64.whl (3.4 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.4/3.4 MB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25hInstalling collected packages: tomotopy\n", "Successfully installed tomotopy-0.12.7\n" ] } ], "source": [ "!pip install tomotopy" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting little_mallet_wrapper\n", " Obtaining dependency information for little_mallet_wrapper from https://files.pythonhosted.org/packages/e3/01/7e8561e33e79b408d9526b22b50e20bfdd8e551979237ad5c972759fe7d8/little_mallet_wrapper-0.5.0-py3-none-any.whl.metadata\n", " Downloading little_mallet_wrapper-0.5.0-py3-none-any.whl.metadata (13 kB)\n", "Downloading little_mallet_wrapper-0.5.0-py3-none-any.whl (19 kB)\n", "Installing collected packages: little_mallet_wrapper\n", "Successfully installed little_mallet_wrapper-0.5.0\n" ] } ], "source": [ "!pip install little_mallet_wrapper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's import `tomotopy`, `little_mallet_wrapper` and the data viz library `seaborn`.\n", "\n", "We're also going to import [`glob`](https://docs.python.org/3/library/glob.html) and [`pathlib`](https://docs.python.org/3/library/pathlib.html#basic-use) for working with files and the file system." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import tomotopy as tp\n", "import little_mallet_wrapper\n", "import seaborn\n", "import glob\n", "from pathlib import Path\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Training Data From Text Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we topic model the *NYT* obituaries, we need to process the text files and prepare them for analysis. The steps below demonstrate how to process texts if your corpus is a collection of separate text files. In the next lesson, we'll demonstrate how to process texts that come from a CSV file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note}\n", " \n", "We're calling these text files our *training data*, because we're *training* our topic model with these texts. The topic model will be learning and extracting topics based on these texts.\n", " \n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the necessary text files, we're going to make a variable and assign it the file path for the directory that contains the text files." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "directory = \"../texts/history/NYT-Obituaries/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we're going to use the `glob.gob()` function to make a list of all (`*`) the `.txt` files in that directory." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "files = glob.glob(f\"{directory}/*.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we process our texts with the function `little_mallet_wrapper.process_string()`.\n", "\n", "This function will take every individual text file, transform all the text to lowercase as well as remove stopwords, punctuation, and numbers, and then add the processed text to our master list `training_data`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "training_data = []\n", "original_texts = []\n", "titles = []\n", "\n", "for file in files:\n", " text = open(file, encoding='utf-8').read()\n", " processed_text = little_mallet_wrapper.process_string(text, numbers='remove')\n", " training_data.append(processed_text)\n", " original_texts.append(text)\n", " titles.append(Path(file).stem)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(379, 379, 379)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(training_data), len(original_texts), len(titles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Topic Model" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic Model Training...\n", "\n", "\n", "Iteration: 0\tLog-likelihood: -10.14774802681711\n", "Iteration: 10\tLog-likelihood: -9.788526732014015\n", "Iteration: 20\tLog-likelihood: -9.648460711087514\n", "Iteration: 30\tLog-likelihood: -9.575582575358267\n", "Iteration: 40\tLog-likelihood: -9.525946364224987\n", "Iteration: 50\tLog-likelihood: -9.49530684718902\n", "Iteration: 60\tLog-likelihood: -9.463579226704619\n", "Iteration: 70\tLog-likelihood: -9.437242753407531\n", "Iteration: 80\tLog-likelihood: -9.4184855871778\n", "Iteration: 90\tLog-likelihood: -9.401415644469493\n", "\n", "Topic Model Results:\n", "\n", "\n", "✨Topic 0✨\n", "\n", "miss company university years work institute ford oil research new\n", "\n", "✨Topic 1✨\n", "\n", "king british war peace israel said minister would prime first\n", "\n", "✨Topic 2✨\n", "\n", "said first became mother children time worked called like died\n", "\n", "✨Topic 3✨\n", "\n", "time made one years great last came death mrs year\n", "\n", "✨Topic 4✨\n", "\n", "general grant gen friends made smith city service office days\n", "\n", "✨Topic 5✨\n", "\n", "world france would german french hitler man time germany also\n", "\n", "✨Topic 6✨\n", "\n", "won movie broadway films hollywood two movies actor one film\n", "\n", "✨Topic 7✨\n", "\n", "music band jazz musical piano sinatra composer new goodman played\n", "\n", "✨Topic 8✨\n", "\n", "said also white black years states would people national united\n", "\n", "✨Topic 9✨\n", "\n", "war soviet united general army party states military communist mao\n", "\n", "✨Topic 10✨\n", "\n", "miss stage new theater york years love director married film\n", "\n", "✨Topic 11✨\n", "\n", "president state roosevelt house court law justice congress political party\n", "\n", "✨Topic 12✨\n", "\n", "book wrote published work one life new years art books\n", "\n", "✨Topic 13✨\n", "\n", "later many one country new became war may years people\n", "\n", "✨Topic 14✨\n", "\n", "new times york could man one good would men business\n", "\n" ] } ], "source": [ "# Number of topics to return\n", "num_topics = 15\n", "# Numer of topic words to print out\n", "num_topic_words = 10\n", "\n", "# Intialize the model\n", "model = tp.LDAModel(k=num_topics)\n", "\n", "# Add each document to the model, after splitting it up into words\n", "for text in training_data:\n", " model.add_doc(text.strip().split())\n", " \n", "print(\"Topic Model Training...\\n\\n\")\n", "# Iterate over the data 10 times\n", "iterations = 10\n", "for i in range(0, 100, iterations):\n", " model.train(iterations)\n", " print(f'Iteration: {i}\\tLog-likelihood: {model.ll_per_word}')\n", "\n", "print(\"\\nTopic Model Results:\\n\\n\")\n", "# Print out top 10 words for each topic\n", "topics = []\n", "topic_individual_words = []\n", "for topic_number in range(0, num_topics):\n", " topic_words = ' '.join(word for word, prob in model.get_topic_words(topic_id=topic_number, top_n=num_topic_words))\n", " topics.append(topic_words)\n", " topic_individual_words.append(topic_words.split())\n", " print(f\"✨Topic {topic_number}✨\\n\\n{topic_words}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examine Top Documents and Titles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load topic distributions" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "topic_distributions = [list(doc.get_topic_dist()) for doc in model.docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make functions for displaying top documents. The `get_top_docs()` function is taken from Maria Antoniak's [Little Mallet Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper/blob/c89bfbeddb11ddc2a6874476985275a7b2a6c1fd/little_mallet_wrapper/little_mallet_wrapper.py#L164)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Markdown, display\n", "import re\n", "\n", "def make_md(string):\n", " display(Markdown(str(string)))\n", "\n", "def get_top_docs(docs, topic_distributions, topic_index, n=5):\n", " \n", " sorted_data = sorted([(_distribution[topic_index], _document) \n", " for _distribution, _document \n", " in zip(topic_distributions, docs)], reverse=True)\n", " \n", " topic_words = topics[topic_index]\n", " \n", " make_md(f\"### ✨Topic {topic_index}✨\\n\\n{topic_words}\\n\\n\")\n", " print(\"---\")\n", " \n", " for probability, doc in sorted_data[:n]:\n", " # Make topic words bolded\n", " for word in topic_words.split():\n", " if word in doc.lower():\n", " doc = re.sub(f\"\\\\b{word}\\\\b\", f\"**{word}**\", doc, re.IGNORECASE)\n", " \n", " make_md(f'✨ \\n**Topic Probability**: {probability} \\n**Document**: {doc}\\n\\n')\n", " \n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display top titles" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "### ✨Topic 0✨\n", "\n", "miss company university years work institute ford oil research new\n", "\n" ], "text/plain": [ "\n", " —Edith Evans Asbury, Georgia O'Keefe Dead At 98\n", "
\n", " \n", "
\n", " | document | \n", "Topic 0 miss company university years | \n", "Topic 1 king british war peace | \n", "Topic 2 said first became mother | \n", "Topic 3 time made one years | \n", "Topic 4 general grant gen friends | \n", "Topic 5 world france would german | \n", "Topic 6 won movie broadway films | \n", "Topic 7 music band jazz musical | \n", "Topic 8 said also white black | \n", "Topic 9 war soviet united general | \n", "Topic 10 miss stage new theater | \n", "Topic 11 president state roosevelt house | \n", "Topic 12 book wrote published work | \n", "Topic 13 later many one country | \n", "Topic 14 new times york could | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1945-Adolf-Hitler | \n", "0.003998 | \n", "0.052907 | \n", "0.038183 | \n", "0.090631 | \n", "0.041851 | \n", "0.272187 | \n", "0.005270 | \n", "0.003061 | \n", "0.032411 | \n", "0.209365 | \n", "0.002381 | \n", "0.038918 | \n", "0.003880 | \n", "0.159284 | \n", "0.045673 | \n", "
1 | \n", "1915-F-W-Taylor | \n", "0.382229 | \n", "0.006007 | \n", "0.040814 | \n", "0.011261 | \n", "0.108247 | \n", "0.010732 | \n", "0.017462 | \n", "0.004687 | \n", "0.064084 | \n", "0.002045 | \n", "0.025165 | \n", "0.045062 | \n", "0.080865 | \n", "0.056370 | \n", "0.144970 | \n", "
2 | \n", "1975-Chiang-Kai-shek | \n", "0.003562 | \n", "0.017498 | \n", "0.048794 | \n", "0.041924 | \n", "0.049930 | \n", "0.051868 | \n", "0.002500 | \n", "0.000403 | \n", "0.071439 | \n", "0.425720 | \n", "0.000832 | \n", "0.040835 | \n", "0.022936 | \n", "0.178661 | \n", "0.043098 | \n", "
3 | \n", "1984-Ethel-Merman | \n", "0.002241 | \n", "0.002951 | \n", "0.087714 | \n", "0.121265 | \n", "0.014088 | \n", "0.030561 | \n", "0.233964 | \n", "0.060015 | \n", "0.100188 | \n", "0.000439 | \n", "0.216502 | \n", "0.003853 | \n", "0.004887 | \n", "0.027888 | \n", "0.093444 | \n", "
4 | \n", "1953-Jim-Thorpe | \n", "0.015850 | \n", "0.021086 | \n", "0.068527 | \n", "0.131478 | \n", "0.035537 | \n", "0.011617 | \n", "0.449206 | \n", "0.006698 | \n", "0.081478 | \n", "0.021061 | \n", "0.002694 | \n", "0.029805 | \n", "0.005263 | \n", "0.032954 | \n", "0.086747 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
374 | \n", "1987-Andres-Segovie | \n", "0.006756 | \n", "0.015406 | \n", "0.095275 | \n", "0.110522 | \n", "0.014879 | \n", "0.039300 | \n", "0.000447 | \n", "0.281958 | \n", "0.051393 | \n", "0.010997 | \n", "0.171608 | \n", "0.013725 | \n", "0.049929 | \n", "0.061056 | \n", "0.076749 | \n", "
375 | \n", "1987-Rita-Hayworth | \n", "0.010537 | \n", "0.047327 | \n", "0.111429 | \n", "0.086836 | \n", "0.021681 | \n", "0.026919 | \n", "0.158125 | \n", "0.000228 | \n", "0.078838 | \n", "0.008181 | \n", "0.324379 | \n", "0.012644 | \n", "0.028124 | \n", "0.048419 | \n", "0.036332 | \n", "
376 | \n", "1993-William-Golding | \n", "0.047350 | \n", "0.042668 | \n", "0.101566 | \n", "0.163478 | \n", "0.036993 | \n", "0.054060 | \n", "0.046941 | \n", "0.000459 | \n", "0.084125 | \n", "0.012064 | \n", "0.031640 | \n", "0.003569 | \n", "0.305274 | \n", "0.055772 | \n", "0.014039 | \n", "
377 | \n", "1932-Florenz-Ziegfeld | \n", "0.029770 | \n", "0.018054 | \n", "0.061188 | \n", "0.204121 | \n", "0.038766 | \n", "0.044261 | \n", "0.213617 | \n", "0.036263 | \n", "0.062639 | \n", "0.000760 | \n", "0.078390 | \n", "0.003791 | \n", "0.025719 | \n", "0.074158 | \n", "0.108504 | \n", "
378 | \n", "1938-Constantin-Stanislavsky | \n", "0.021461 | \n", "0.007131 | \n", "0.080755 | \n", "0.123638 | \n", "0.012458 | \n", "0.076418 | \n", "0.082275 | \n", "0.001896 | \n", "0.044730 | \n", "0.071337 | \n", "0.179404 | \n", "0.000996 | \n", "0.132997 | \n", "0.108985 | \n", "0.055519 | \n", "
379 rows × 16 columns
\n", "\n", " | document | \n", "Topic 0 miss company university years | \n", "Topic 1 king british war peace | \n", "Topic 2 said first became mother | \n", "Topic 3 time made one years | \n", "Topic 4 general grant gen friends | \n", "Topic 5 world france would german | \n", "Topic 6 won movie broadway films | \n", "Topic 7 music band jazz musical | \n", "Topic 8 said also white black | \n", "Topic 9 war soviet united general | \n", "Topic 10 miss stage new theater | \n", "Topic 11 president state roosevelt house | \n", "Topic 12 book wrote published work | \n", "Topic 13 later many one country | \n", "Topic 14 new times york could | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
348 | \n", "1983-Earl-Hines | \n", "0.016744 | \n", "0.001620 | \n", "0.098582 | \n", "0.111423 | \n", "0.002794 | \n", "0.005206 | \n", "0.016193 | \n", "0.535578 | \n", "0.083910 | \n", "0.016249 | \n", "0.016160 | \n", "0.001868 | \n", "0.031981 | \n", "0.036905 | \n", "0.024788 | \n", "
353 | \n", "1993-Dizzy-Gillespie | \n", "0.006825 | \n", "0.012989 | \n", "0.124421 | \n", "0.094888 | \n", "0.028999 | \n", "0.024149 | \n", "0.037285 | \n", "0.511854 | \n", "0.033391 | \n", "0.000414 | \n", "0.036492 | \n", "0.000499 | \n", "0.021880 | \n", "0.029474 | \n", "0.036441 | \n", "
235 | \n", "1991-Miles-Davis | \n", "0.000567 | \n", "0.019191 | \n", "0.149557 | \n", "0.057779 | \n", "0.021961 | \n", "0.027519 | \n", "0.029750 | \n", "0.463009 | \n", "0.079487 | \n", "0.000430 | \n", "0.061520 | \n", "0.000518 | \n", "0.000717 | \n", "0.037121 | \n", "0.050874 | \n", "
164 | \n", "1984-Count-Basie | \n", "0.008501 | \n", "0.002028 | \n", "0.137130 | \n", "0.105632 | \n", "0.035726 | \n", "0.007766 | \n", "0.015505 | \n", "0.427758 | \n", "0.099782 | \n", "0.005188 | \n", "0.040135 | \n", "0.014811 | \n", "0.010237 | \n", "0.019516 | \n", "0.070285 | \n", "
288 | \n", "1986-Benny-Goodman | \n", "0.005479 | \n", "0.006280 | \n", "0.092585 | \n", "0.063830 | \n", "0.011631 | \n", "0.032268 | \n", "0.052006 | \n", "0.423876 | \n", "0.110282 | \n", "0.007996 | \n", "0.071421 | \n", "0.017105 | \n", "0.014621 | \n", "0.030448 | \n", "0.060173 | \n", "
29 | \n", "1983-Muddy-Waters | \n", "0.005819 | \n", "0.016610 | \n", "0.162999 | \n", "0.058103 | \n", "0.037705 | \n", "0.007511 | \n", "0.013396 | \n", "0.412791 | \n", "0.133101 | \n", "0.002406 | \n", "0.044864 | \n", "0.019892 | \n", "0.001386 | \n", "0.021340 | \n", "0.062079 | \n", "