# Named Entity Recognition for Spanish

:::{note}
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
:::

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Spanish. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

The example text for Spanish is *Oasis en la vida* by Juana Manuela Gorriti [from Project Gutenberg](http://www.gutenberg.org/ebooks/62564).

**Here's a preview of spaC's NER tagging *Oasis en la vida*.**

If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Spanish NER is much less good at recognizing entities, and is especially bad at distinguishing different kinds of entities, like ORG vs LOC vs PER. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the [data sources used to train Spanish](https://spacy.io/models/es) on the spaCy model page.

In [8]:
displacy.render(document, style="ent")

---

## NER with spaCy
If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the Spanish-language model (`es_core_news_md`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Spanish](https://spacy.io/models/es) on the spaCy model page.

In [2]:
!python -m spacy download es_core_news_md

Collecting es-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.7.0/es_core_news_md-3.7.0-py3-none-any.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')


### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [3]:
import es_core_news_md
nlp = es_core_news_md.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('es_core_news_md')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we run`nlp()` on the text and create our document.

In [6]:
filepath = '../texts/es.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document.

In [7]:
document.ents

(﻿INTRODUCCION,
 ECONOMÍA POLÍTICA,
 El sombrío Prudhon,
 imbuído,
 Santos
 Padres de la Iglesia,
 teorías,
 La industria,
 VACA-GUZMAN,
 OASIS EN,
 AUTORA,
 Mauricio Ridel,
 Fin,
 Mauricio,
 copas de los árboles,
 Enrique,
 María,
 Catorce horas,
 Un poco de sueño,
 Mauricio,
 Diez horas! ¡Ah!,
 Redaccion,
 Emilio,
 Sábelo,
 Mauricio,
 Regente,
 Suma: ¡catorce horas!...
 ¡Adios,
 fariseo.--,
 Mauricio,
 Emilio,
 ánsia,
 Mauricio,
 Uncido,
 En los teatros,
 Enigma,
 Emilio,
 Mauricio,
 Cárlos Ridel,
 ¡Madrastra!,
 Siempre espiados por la
 mirada,
 Mauricio,
 Víctima de una semejanza,
 Europa,
 Francia,
 Paris,
 sábio Blain,
 Colombe,
 Los exámenes,
 Sagrado Libro,
 Mauricio,
 Mr,
 Blain,
 Colombe,
 Notre Dame du bon Secour,
 Envíaselo dentro de
 tu primera,
 La hora,
 Colombe,
 La luz de un lejano,
 Sonrióle,
 Colombe,
 Besó la santa imágen,
 guardóla,
 Mauricio,
 A la
 inquieta turbulencia del niño,
 Mauricio,
 Allá,
 Mauricio,
 Mauricio,
 Suscribíala,
 Buenos Aires,
 otro--decíale,
 

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [9]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

INTRODUCCION MISC
ECONOMÍA POLÍTICA MISC
El sombrío Prudhon MISC
imbuído LOC
Santos
Padres de la Iglesia ORG
teorías PER
La industria MISC
VACA-GUZMAN ORG
OASIS EN ORG
AUTORA ORG
Mauricio Ridel PER
Fin MISC
Mauricio LOC
copas de los árboles MISC
Enrique PER
María PER
Catorce horas MISC
Un poco de sueño MISC
Mauricio LOC
Diez horas! ¡Ah! MISC
Redaccion MISC
Emilio PER
Sábelo LOC
Mauricio LOC
Regente PER
Suma: ¡catorce horas!...
¡Adios MISC
fariseo.-- MISC
Mauricio LOC
Emilio PER
ánsia LOC
Mauricio PER
Uncido PER
En los teatros MISC
Enigma MISC
Emilio PER
Mauricio LOC
Cárlos Ridel PER
¡Madrastra! MISC
Siempre espiados por la
mirada MISC
Mauricio PER
Víctima de una semejanza MISC
Europa LOC
Francia LOC
Paris PER
sábio Blain PER
Colombe LOC
Los exámenes MISC
Sagrado Libro MISC
Mauricio PER
Mr PER
Blain PER
Colombe PER
Notre Dame du bon Secour LOC
Envíaselo dentro de
tu primera MISC
La hora MISC
Colombe LOC
La luz de un lejano MISC
Sonrióle PER
Colombe PER
Besó la santa imágen MISC
guardóla

To extract just the named entities that have been identified as `PER` (person), we can add a simple `if` statement into the mix:

In [10]:
for named_entity in document.ents:
    if named_entity.label_ == "PER":
        print(named_entity)

teorías
Mauricio Ridel
Enrique
María
Emilio
Regente
Emilio
Mauricio
Uncido
Emilio
Cárlos Ridel
Mauricio
Paris
sábio Blain
Mauricio
Mr
Blain
Colombe
Sonrióle
Colombe
guardóla
Mauricio
Mauricio
Cárlos
Ridel
Mauricio Ridel
Ridel
réstame
Mauricio
Paris
consagróse
Arrojóse
Paris
Valerio--su
Vd.
sócio bribon
sócio
Ridel
Cárlos Ridel
subióse
Mauricio
Cárlos Ridel
Había
Mauricio
apresuróse
Mauricio
sábio Blain
Colombe
Paris
Mauricio
Diciembre
Lloraba
Acojido
abrióse
Sr. Santa Coloma
Vice-Cónsul Argentino
cariñosa
conmiseracion
Vd
Julia
Rendidos
Además
Julia Lopez
Julia Lopez
Mauricio
Mauricio
exhalóse
Mauricio
Julia
Lopez
Vice-Cónsul
señor?--preguntó el automedon
Mauricio
amargura.--Sí
Mauricio
habríanle
Mauricio
Mauricio
madame Bazan
Madame Bazan
Mauricio
Capricho
madame Bazan
Mauricio
damasco azul
madame Bazan
Mauricio
Mauricio
Mauricio
Mauricio
Mauricio
Paris
Renata
Jesucristo
Renata
Barrieres
Le Courrier de la Plata
madame Arnaud
Paris
Mauricio
Renata
astrakan
Renata déme Vd.
Julia Lopez
R

## NER with Long Texts or Many Texts

In [11]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [12]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PER."

:::{admonition} Pandas Review
:class: pandasreview
 Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html"> Pandas Basics (1-3) </a> in this textbook!
:::

In [13]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PER":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Mauricio,52
1,Julia,17
2,Cárlos Ridel,9
3,Renata,9
4,Paris,6
5,Ridel,6
6,Jesús,6
7,Emilio,4
8,Julia Lopez,4
9,madame Bazan,4


## Get Places

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC."

In [14]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Mauricio,60
1,Buenos Aires,12
2,Francia,4
3,Burdeos,4
4,Senegal,4
5,Rio Janeiro,4
6,Sanabria,3
7,ánsia,2
8,Europa,2
9,Colombe,2


## Get NER in Context

In [15]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
         # all possible labels
        desired_ner_labels = list(nlp.get_pipe('ner').labels)  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                print('---')
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [16]:
for document in chunked_documents:
    get_ner_in_context('Francia', document)

---


**LOC**

Aunque del hogar de sus padres, el pobre niño, solo guardara crueles recuerdos, la lengua materna, el suelo de la patria, su aire, su luz, éranle necesarios, y languideció, echándolos de menos.  Por dicha suya fué el «bello país de **Francia**,» la hospitalaria Paris, el lugar de su destierro.  

---


**LOC**

Sin embargo, Mauricio amaba tambien la **Francia**.  