# Text Pre-Processing for Danish

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

This lesson is for anyone who wants to try the [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF.html) or [topic modeling](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling.html) lessons on Danish texts. Before continuing with those lessons, you need to create a *lemmatized derivative* of your original Danish text, which replaces all the words with their dictionary form, which will work much better with the word-count based methods.

## Install spaCy

In [None]:
!pip install -U spacy

## Download Language Model

In [None]:
!python -m spacy download da_core_news_md

## Import Libraries

In [1]:
import spacy

## Load Language Model
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

In [2]:
import da_core_news_md
nlp = da_core_news_md.load()

2. We can load the model by name.

In [None]:
#nlp = spacy.load('da_core_news_md')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).

## Process Document
To create a derivative text file that we can use with TF-IDF, topic modeling, or other word-count based methods, we need to use spaCy to *lemmatize* the text, replacing each word with its dictionary form. The result will be an ungrammatical text that will produce better results than the original version when used with word-count methods.

The example text for Danish is *Evangelines Genvordigheder: Til Kvinder med rødt Haar* by Elinor Glyn [from Project Gutenberg](http://www.gutenberg.org/ebooks/33632).

Here we open the text and process it with the Danish spaCy model.

In [3]:
filepath = '../texts/da.txt'
# Open and read text
text = open(filepath, encoding='utf-8').read()
# Process text with spaCy
document = nlp(text)

Then we loop through each token in the original text, lemmatize each token and insert a space between the tokens, and finally write them to our new derivative text file.

In [21]:
outname = filepath.replace('.txt', '-lemmatized.txt')

# Create a lemmatized version of the original text file
with open(outname, 'w', encoding='utf8') as out:
    
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

## Examine Differences
The code cell below prints the original word in the text, a dash, then the lemmatized form that was written to the derivative text document that you'll use for TF-IDF and topic modeling. It's a good idea to take a look at this so you can see if there are places where the model consistently makes mistakes.

For instance, an earlier version of spaCy often associated the Spanish preposition `para` ("for") with the verb `parar` ("to stop"). If you just took that derivative file and used it for TF-IDF or topic modeling without realizing what was happening, you might reach the surprising conclusion that "stop" is a very frequent word in your text, when actually it's lemmatization problem. 

In [22]:
for token in document:
    print(token.text + ' - ' + token.lemma_)

﻿EVANGELINES - ﻿EVANGELINES

  - 
 
GENVORDIGHEDER - GENVORDIGHEDER


  - 

 
TIL - TIL

  - 
 
KVINDER - KVINDER
MED - MED
RØDT - RØDT
HAAR - HAAR





  - 




 
ELINOR - ELINOR
GLYN - GLYN


  - 

 
EVANGELINES - EVANGELINES

  - 
 
GENVORDIGHEDER - GENVORDIGHEDER


  - 

 
AUTORISERET - AUTORISERET
OVERSÆTTELSE - OVERSÆTTELSE

  - 
 
FOR - FOR
NORGE - NORGE
OG - OG
DANMARK - DANMARK
AF - AF

  - 
 
HEDVIG - HEDVIG
MAGNUSSEN - MAGNUSSEN


  - 

 
NY - NY
UDGAVE - UDGAVE


  - 

 
MARTINS - MARTINS
FORLAG - FORLAG

  - 
 
KØBENHAVN - KØBENHAVN
& - &
KRISTIANIA - KRISTIANIA

  - 
 
MCMXXI - MCMXXI





  - 




 
MARTIN'S - MARTIN'S
FORLAGSTRYKKERI - FORLAGSTRYKKERI
, - ,
KØBENHAVN - KØBENHAVN





 - 





BEGYNDELSEN - BEGYNDELSEN
PAA - PAA
EVANGELINES - EVANGELINES
DAGBOG - DAGBOG



                                                - 


                                               
_ - _
Branches - Branches
Park - Park
_ - _
. - .

                                               - 