# Text Pre-Processing for Russian

:::{note}
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
:::

This lesson is for anyone who wants to try the [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF.html) or [topic modeling](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling.html) lessons on Russian texts. Before continuing with those lessons, you need to create a *lemmatized derivative* of your original Russian text, which replaces all the words with their dictionary form, which will work much better with the word-count based methods.

## Install spaCy
Russian models are only available starting in spaCy 3.0. 

If you run into errors because spaCy 2.x is installed, you can run `!pip uninstall spacy -y` first, then run the cell below.

In [None]:
!pip install -U spacy>=3.0

## Download Language Model

In [None]:
!python -m spacy download ru_core_news_md

## Import Libraries

In [None]:
import spacy

## Load Language Model
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

In [11]:
import ru_core_news_md
nlp = ru_core_news_md.load()

2. We can load the model by name.

In [None]:
#nlp = spacy.load('ru_core_news_md')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).

## Process Document
To create a derivative text file that we can use with TF-IDF, topic modeling, or other word-count based methods, we need to use spaCy to *lemmatize* the text, replacing each word with its dictionary form. The result will be an ungrammatical text that will produce better results than the original version when used with word-count methods.

The example text for Russian is *Яблони цветут* from *Новые люди* by Зинаида Николаевна Гиппиус [from Библиотека русской и советской классики](https://ruslit.traumlibrary.net/book/gippius-ss15-01/gippius-ss15-01.html). (Thanks to Katherine Bowers for the text referral.)

Here we open the text and process it with the Russian spaCy model.

In [12]:
filepath = '../texts/ru.txt'
# Open and read text
text = open(filepath, encoding='utf-8').read()
# Process text with spaCy
document = nlp(text)

Then we loop through each token in the original text, lemmatize each token and insert a space between the tokens, and finally write them to our new derivative text file.

In [13]:
outname = filepath.replace('.txt', '-lemmatized.txt')

# Create a lemmatized version of the original text file
with open(outname, 'w', encoding='utf8') as out:
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

## Examine Differences
The code cell below prints the original word in the text, a dash, then the lemmatized form that was written to the derivative text document that you'll use for TF-IDF and topic modeling. It's a good idea to take a look at this so you can see if there are places where the model consistently makes mistakes, and also to understand what lemmatization is doing.

For instance, an earlier version of spaCy often associated the Spanish preposition `para` ("for") with the verb `parar` ("to stop"). If you just took that derivative file and used it for TF-IDF or topic modeling without realizing what was happening, you might reach the surprising conclusion that "stop" is a very frequent word in your text, when actually it's lemmatization problem. 

Also, keep in mind that lemmatization removes both case and and gender (from adjectives and verbs). So if you want to look at how often a particular adjective or past-tense verb is applied to masculine things or characters vs. feminine,  you'll need to prepare your text another way. (Be careful using the unlemmatized version, though: if you do, be sure to account for -- and count -- all of the adjectival declensions for each gender.)

In [14]:
for token in document:
    print(token.text + ' - ' + token.lemma_)

Яблони - яблони
цветут - цвести
* - *


 - 


I - i

 - 

Зачем - зачем
она - она
так - так
сделала - сделать
, - ,
что - что
я - я
не - не
умею - уметь
жить - жить
без - без
нее - нее
? - ?
Это - это
она - она
сделала - сделать
, - ,
я - я
не - не
виноват - виноватый
… - …


 - 


Я - я
написал - написать
это - это
– - –
и - и
мне - мне
стало - стать
странно - странный
. - .
Говорю - говорить
точно - точно
о - о
возлюбленной - возлюбленной
. - .
Но - но
возлюбленной - возлюбленной
у - у
меня - меня
нет - нет
. - .
Это - это
моя - мой
мать - мать
сделала - сделать
так - так
, - ,
что - что
я - я
умираю - умирать
без - без
нее - нее
. - .
Если - если
человека - человек
держать - держать
в - в
тепле - тепло
всю - весь
жизнь - жизнь
, - ,
а - а
потом - потом
неодетого - неодетый
выгнать - выгнать
на - на
двадцатиградусный - двадцатиградусный
мороз - мороз
, - ,
он - он
непременно - непременно
умрет - умереть
. - .
И - и
я - я
умру - умереть
. - .
Умру - умру
из - из
- - -
за - за
нее - не