Topic Modeling — Set Up

Topic Modeling — Set Up#

In these lessons, we’re learning about a text analysis method called topic modeling. This method will help us identify the main topics or discourses within a collection of texts or within a single text that has been separated into smaller text chunks.

This page describes how to set up the packages and programs that you’ll need if you want to start topic modeling on your own computer. If you want to topic model without installing anything, however, you can skip ahead and explore these Jupyter notebook topic modeling lessons in the cloud. The notebooks already have the necessary requirements installed.

MALLET & Little MALLET Wrapper#

For our topic modeling analysis, we’re going to use a tool called MALLET. MALLET, short for MAchine Learning for LanguagE Toolkit, is a software package for topic modeling and other natural language processing techniques. It’s maintained by David Mimno, a Cornell professor in Information Science. Go Big Red!

MALLET is great, but it’s written in Java, another programming language, which means that we have to install Java before we can use it. It also means that MALLET isn’t typically ideal for Python and Jupyter notebooks.

Luckily, another Cornellian, Maria Antoniak, a PhD student in Information Science, has written a convenient Python package that will allow us to use MALLET in this Jupyter notebook after we download and install Java. This package is called Little MALLET Wrapper.

Note: A “wrapper” is a Python package that makes complicated code easier to use and/or makes code from a different programming language accessible in Python.

Download and Install Java Development Kit#

But first, we have to install Java, specifically the Java Development Kit.

Go to the Java Development Kit download page, find your operating system, and click on the corresponding download link: https://www.oracle.com/java/technologies/javase-jdk14-downloads.html

Linux -> Linux Compressed Archive
Mac -> macOS Installer
Windows -> Windowsx64 Installer

Then open or unzip the file and follow all the prompts. You can use all the suggested defaults.

Tell Your Computer Where to Find Java#

Now that we have the JDK downloaded, we have to tell our computers where to find it. For Mac/Chrome/Linux users, we have to set up a special “environment” variable called JAVA_HOME and give it the file path where we just downloaded our Java Development Kit. For Windows users, we have to edit the special environmental variable called PATH and add the file path of the JDK.

Note: “Environment” variables are kind of like Python variables, except they exist in your whole computer environment. The Launch School has a helpful chapter on environment variables and the PATH variable.

 Mac#

To set up the JAVA_HOME environment variable on a Mac, you can run the following on the command line. The line of code adds your JAVA_HOME variable to a file called “bash_profile”, which is where environment variables are stored.

!echo "export JAVA_HOME=$(/usr/libexec/java_home)" >> ~/.bash_profile

To immediately update your “bash_profile,” run:

!source ~/.bash_profile

Then, to test whether Java installed correctly, run javac on the command line. If you get a list of options, as below, then you’ve installed the JDK properly. If it says the command is not recognized, then you don’t have JDK set up yet.

!javac

Usage: javac <options> <source files>
where possible options include:
  @<filename>                  Read options and filenames from file
  -Akey[=value]                Options to pass to annotation processors
  --add-modules <module>(,<module>)*
        Root modules to resolve in addition to the initial modules, or all modules
        on the module path if <module> is ALL-MODULE-PATH.
  --boot-class-path <path>, -bootclasspath <path>
        Override location of bootstrap class files
  --class-path <path>, -classpath <path>, -cp <path>
        Specify where to find user class files and annotation processors
  -d <directory>               Specify where to place generated class files
  -deprecation
        Output source locations where deprecated APIs are used
  --enable-preview
        Enable preview language features. To be used in conjunction with either -source or --release.
  -encoding <encoding>         Specify character encoding used by source files
  -endorseddirs <dirs>         Override location of endorsed standards path
  -extdirs <dirs>              Override location of installed extensions
  -g                           Generate all debugging info
  -g:{lines,vars,source}       Generate only some debugging info
  -g:none                      Generate no debugging info
  -h <directory>
        Specify where to place generated native header files
  --help, -help, -?            Print this help message
  --help-extra, -X             Print help on extra options
  -implicit:{none,class}
        Specify whether or not to generate class files for implicitly referenced files
  -J<flag>                     Pass <flag> directly to the runtime system
  --limit-modules <module>(,<module>)*
        Limit the universe of observable modules
  --module <module>(,<module>)*, -m <module>(,<module>)*
        Compile only the specified module(s), check timestamps
  --module-path <path>, -p <path>
        Specify where to find application modules
  --module-source-path <module-source-path>
        Specify where to find input source files for multiple modules
  --module-version <version>
        Specify version of modules that are being compiled
  -nowarn                      Generate no warnings
  -parameters
        Generate metadata for reflection on method parameters
  -proc:{none,only}
        Control whether annotation processing and/or compilation is done.
  -processor <class1>[,<class2>,<class3>...]
        Names of the annotation processors to run; bypasses default discovery process
  --processor-module-path <path>
        Specify a module path where to find annotation processors
  --processor-path <path>, -processorpath <path>
        Specify where to find annotation processors
  -profile <profile>
        Check that API used is available in the specified profile
  --release <release>
        Compile for the specified Java SE release. Supported releases: 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
  -s <directory>               Specify where to place generated source files
  --source <release>, -source <release>
        Provide source compatibility with the specified Java SE release. Supported releases: 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
  --source-path <path>, -sourcepath <path>
        Specify where to find input source files
  --system <jdk>|none          Override location of system modules
  --target <release>, -target <release>
        Generate class files suitable for the specified Java SE release. Supported releases: 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
  --upgrade-module-path <path>
        Override location of upgradeable modules
  -verbose                     Output messages about what the compiler is doing
  --version, -version          Version information
  -Werror                      Terminate compilation if warnings occur

Windows#

To edit the PATH variable on a Windows computer, follow the instructions below:

- Open Search and type "advanced system settings"
- In the shown options, select the "View advanced system settings" link
- Under the Advanced tab, click "Environment Variables"
- Under "System variables," click the variable "PATH" and then click "Edit"
- Click "New" and add the file path to the JDK (e.g. `C:\Program Files\Java\jdk13.0.2\bin`)

For more Windows installation help, see Prof. Paul Vierthaler’s video tutorial “Practical Python for DH: Topic Modeling Software Install”.

Now restart your PowerShell. To test whether java is installed, run javac in the PowerShell. If you get a list of options, then you’ve installed the JDK properly. If it says the command is not recognized, then you don’t have it yet.

!javac

Chrome / Linux#

To set up the JAVA_HOME environment variable on a Linux machine or a Chrome computer running Linux, you can run the following on the command line. The line of code adds your JAVA_HOME variable to a file called “bashrc”, which is where environment variables are stored.

Make sure to change /fill-in-the-path/to/your-java_installation to the file path where your JDK actually exists below:

!echo "export JAVA_HOME=/fill-in-the-path/to/your-java_installation/bin" >> ~/.bashrc

To immediately update your “bash_profile,” run:

!source ~/.bashrc

To test whether java is installed, run javac on the command line. If you get a list of options, as below, then you’ve installed the JDK properly. If it says the command is not recognized, then you don’t have it yet.

!javac

Download and Unzip MALLET#

Now we need to download the MALLET package. To download MALLET, click the following link http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip or find the link on the MALLET home page. Once the zip file downloads, unzip it.

If you’re using a Mac, move the “mallet-2.0.8” directory into your home folder.

Note: To open your “home” folder, open “Finder” and type Cmd + Shift + H. To move one directory up, type Cmd + ↑. Now, if you want to bookmark your home folder so you can find it more easily in the future, simply drag and drop your home folder to the sidebar.

If you’re using a Windows computer, move the “mallet-2.0.8” directory int your C:\ drive.

Heads Up Windows Users!#

You need to complete one more step. You need to once again tell your computer where MALLET is located:

Open Search and type “advanced system settings”
In the shown options, select the View advanced system settings link
Under the Advanced tab, click “Environment Variables”
In the User variables section, click “New”
For the Variable name, type MALLET_HOME. For the Value, type the path to your MALLET: C:\mallet-2.0.8. Then click OK
Click OK and click Apply to apply the changes

For more Windows help, see Prof. Paul Vierthaler’s topic modeling tutorial.

To test whether MALLET works on your computer, type in the file path for MALLET on the command line and import-file.

If it’s working, then you’ll get a message that says “A tool for creating instance lists of feature vectors from comma-separated-values” and a list of options.

!~/mallet-2.0.8/bin/mallet import-file

A tool for creating instance lists of feature vectors from comma-separated-values
--help TRUE|FALSE
  Print this command line option usage information.  Give argument of TRUE for longer documentation
  Default is false
--prefix-code 'JAVA CODE'
  Java code you want run before any other interpreted code.  Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's when creating objects.
  Default is null
--config FILE
  Read command option values from a file
  Default is null
--input FILE
  The file containing data to be classified, one instance per line
  Default is null
--output FILE
  Write the instance list to this file; Using - indicates stdout.
  Default is text.vectors
--line-regex REGEX
  Regular expression containing regex-groups for label, name and data.
  Default is ^(\S*)[\s,]*(\S*)[\s,]*(.*)$
--label INTEGER
  The index of the group containing the label string.
   Use 0 to indicate that the label field is not used.
  Default is 2
--name INTEGER
  The index of the group containing the instance name.
   Use 0 to indicate that the name field is not used.
  Default is 1
--data INTEGER
  The index of the group containing the data.
  Default is 3
--use-pipe-from FILE
  Use the pipe and alphabets from a previously created vectors file.
   Allows the creation, for example, of a test set of vectors that are
   compatible with a previously created set of training vectors
  Default is text.vectors
--keep-sequence [TRUE|FALSE]
  If true, final data will be a FeatureSequence rather than a FeatureVector.
  Default is false
--keep-sequence-bigrams [TRUE|FALSE]
  If true, final data will be a FeatureSequenceWithBigrams rather than a FeatureVector.
  Default is false
--label-as-features [TRUE|FALSE]
  If true, parse the 'label' field as space-delimited features.
     Use feature=[number] to specify values for non-binary features.
  Default is false
--remove-stopwords [TRUE|FALSE]
  If true, remove a default list of common English "stop words" from the text.
  Default is false
--replacement-files FILE [FILE ...]
  files containing string replacements, one per line:
    'A B [tab] C' replaces A B with C,
    'A B' replaces A B with A_B
  Default is (null)
--deletion-files FILE [FILE ...]
  files containing strings to delete after replacements but before tokenization (ie multiword stop terms)
  Default is (null)
--stoplist-file FILE
  Instead of the default list, read stop words from a file, one per line. Implies --remove-stopwords
  Default is null
--extra-stopwords FILE
  Read whitespace-separated words from this file, and add them to either 
   the default English stoplist or the list specified by --stoplist-file.
  Default is null
--stop-pattern-file FILE
  Read regular expressions from a file, one per line. Tokens matching these regexps will be removed.
  Default is null
--preserve-case [TRUE|FALSE]
  If true, do not force all strings to lowercase.
  Default is false
--encoding STRING
  Character encoding for input file
  Default is UTF-8
--token-regex REGEX
  Regular expression used for tokenization.
   Example: "[\p{L}\p{N}_]+|[\p{P}]+" (unicode letters, numbers and underscore OR all punctuation) 
  Default is \p{L}[\p{L}\p{P}]+\p{L}
--print-output [TRUE|FALSE]
  If true, print a representation of the processed data
   to standard output. This option is intended for debugging.
  Default is false

Install Little MALLET Wrapper#

Finally, we’re going to install the Python package little_mallet_wrapper. To install it, run pip install little_mallet_wrapper, as below.

!pip install little_mallet_wrapper

Since Little MALLET Wrapper also uses the data visualization library seaborn, we’re also going to pip install seaborn:

!pip install seaborn