Working in Languages Beyond English¶
This section, “Working in Languages Beyond English,” is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I’m grateful to Quinn for helping expand this textbook to serve languages beyond English.
Most of the tools and tutorials you’ll find for computational text analysis assume that you’re working with English language text. This section is dedicated to helping students and scholars accomplish text analysis tasks in languages beyond English. Select lessons from the “Text Analysis” chapter have been adapted for the following languages and tasks:
If you’re interested in adding support for another language, please reach out at email@example.com!
Two Kinds of Text Analysis¶
The steps you need to take to analyze a language beyond English will depend on the kind of text analysis method that you are interesting in using. The methods introduced in this chapter can be broadly organized into two groups:
Methods based on word counts, such as TF-IDF and topic modeling
Methods that use language-specific NLP models, such as Named Entity Recognition and part-of-speech-tagging
There are more resources to support non-English text analysis in the first group of methods than in the second group:
To apply the first group of methods to non-English texts, you will need to pre-process your texts — in other words, to create a derivative version of your text that will work better with these tools.
To apply the second group of methods to non-English texts, you will need to find a language-specific version of the NLP models.
Unfortunately, for most of the roughly 6,500 languages spoken in the world, there are currently few if any language-specific tools or resources to support computational analysis. In fact, out of the 100 languages with the greatest number of speakers, at least 2/3 are missing the tools you’ll need to complete all the activities in this section of the textbook.
1. Text Analysis Based on Word Counts: Pre-Processing Your Texts¶
The pre-processing steps needed to make texts in other languages usable with computational text analysis methods vary depending on the language. For example, some languages, such as Chinese, do not separate words with spaces, and texts in these languages will need to have artificial spaces inserted before text analysis. You can find an example of how to insert white spaces into Chinese language texts in “Text Pre-Processing for Chinese.”
Other languages with more inflection than English (e.g. where words appear in different forms, depending on how they’re used) need to be lemmatized, replacing every variant word form with the dictionary form, or stemmed, cutting off the inflection at the end of the word. Lemmatizing or stemming usually (but not always) leaves you with something resembling the root. For example, in Spanish
hablar (“to speak”) and its inflected forms
hablo (“I speak”) and
hablas (“you speak”) all become
hab when stemmed. You can find an example of how to lemmatize a text in “Text Pre-Processing for Spanish.”
The situation is even more complicated for languages known as agglutinative languages, in which words are formed by repeatedly gluing together morphemes, or small bits of meaning. In agglutinative languages, a single “word” can be translated as an entire English sentence. How would you reduce a word like Turkish Çekoslovakyalılaştıramadıklarımızdanmışsınız — which, in English, means, “you are reportedly one of those that we could not make Czechoslovakian” — down to a root that you could count?
When doing text analysis in English, you can do things like word frequency without thinking too much about questions like “what, actually, is a word?” However, the ways you have to modify text in many other languages to make it compatible with computational text analysis — even to the point of harming human readability — mean that you have to grapple with this question more directly when working with other languages.
2. Text Analysis Based on NLP Models: Finding an NLP Model for Your Language¶
To perform named entity recognition and part-of-speech tagging on English-language texts in this book, we use a natural language processing model from the spaCy library. This model was trained on a large amount of carefully-labeled English-language texts — specifically, “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech.” The labels, which were often created and corrected by hand, identify where people and places, verbs and nouns, subjects and objects, etc. appear in the texts, which helps the model learn how to recognize these entities on its own. Like a lot of other major machine learning projects, this labeled corpus, called “OntoNotes”, was also funded by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.
To use spaCy (or any similar tool) for named entity recognition or part-of-speech tagging in another language, you need an NLP model that has been specifically trained on that language. Because these models require a large amount of training texts and a lot of resources, money, and time to create, it can often be difficult to find a high-quality NLP model for every language. This has led to big disparities in the global NLP and digital humanities communities.
However, spaCy has released NLP models for languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Danish, Greek, Norwegian, and Lithuanian. You can find examples of how to use spaCy models for Chinese, Danish, Portuguese, Russian, and Spanish in this book.
Though we are happy to make these tutorials available, as you will see, most of these models do not perform as well as the English-language spaCy model. Developing better tools for languages beyond English is one of the most important and urgent tasks for NLP and the digital humanities.