# Part-of-Speech Tagging 

In this lesson, we're going to learn about the textual analysis methods *part-of-speech tagging* and *keyword extraction*. These methods will help us computationally parse sentences and better understand words in context.

We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for other languages:

- [Chinese POS Tagging](Multilingual/Chinese/03-POS-Keywords-Chinese)
- [Danish POS Tagging](Multilingual/Danish/03-POS-Keywords-Danish)
- [Portuguese POS Tagging](Multilingual/Portuguese/03-POS-Keywords-Portuguese)
- [Russian POS Tagging](Multilingual/Russian/03-POS-Keywords-Russian)
- [Spanish POS Tagging](Multilingual/Spanish/03-POS-Keywords-Spanish)


---

<blockquote class="epigraph" style=" padding: 10px">

[Charles] Babbage, who called [Ada Lovelace] the ‚Äúenchantress of numbers,‚Äù once wrote that she ‚Äúhas thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.

-Claire Cain Miller, ["Ada Lovelace,"](https://www.nytimes.com/interactive/2018/obituaries/overlooked-ada-lovelace.html) *NYT Overlooked Obituaries*
</blockquote>

In [16]:
#Set some display options for the visualizer
options = {"compact": True, "distance": 50, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(document, style="dep", options=options)

---

## Why is Part-of-Speech Tagging Useful?

I don't mean to go all [Language Nerd](https://xkcd.com/1443/) on you, but parts of speech are important. Even if they seem kind of boring. *Parts of speech* are the grammatical units of language ‚Äî¬†such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence.

![](https://imgs.xkcd.com/comics/language_nerd.png)


By computationally identifying parts of speech, we can start computationally exploring *syntax*, the relationship between words ‚Äî rather than only focusing on words in isolation, as we did with tf-idf. Though parts of speech may seem pedantic, they help computers (and us) crack at that ever-elusive abstract noun: *meaning*. 

## spaCy and Natural Language Processing (NLP)

To computationally identify parts of speech, we're going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. The English-language spaCy model that we're going to use in this lesson was trained on an annotated corpus called ["OntoNotes"](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf): 2 million+ words drawn from "news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech," which were meticulously tagged by a group of researchers and professionals for people's names and places, for nouns and verbs, for subjects and objects, and much more.

## Install spaCy

To use spaCy, we first need to install the library.

In [None]:
!pip install -U spacy

## Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [5]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 400
pd.options.display.max_colwidth =  400

We're also going to import the `Counter` module for counting nouns, verbs, adjectives, etc., and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

## Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [6]:
!python -m spacy download en_core_web_sm

[38;5;2m‚úî Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian*.  

*spaCy offers language and tokenization support for other language via external dependencies ‚Äî such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean.*

## Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [6]:
nlp = spacy.load('en_core_web_sm')

## Create a Processed spaCy Document

Whenever we use spaCy, our first step will be to create a processed spaCy `document` with the loaded NLP model `nlp()`. Most of the heavy NLP lifting is done in this line of code. After processing, the `document` object will contain tons of juicy language data ‚Äî named entities, sentence boundaries, parts of speech ‚Äî¬†and the rest of our work will be devoted to accessing this information.

To test out spaCy's part-of-speech tagging, we'll begin by processing a sample sentence from Ada Lovelace's obituary:

> "[Charles] Babbage, who called [Ada Lovelace] the ‚Äúenchantress of numbers,‚Äù once wrote that
she ‚Äúhas thrown her magical **spell** around the most **abstract** of Sciences and has grasped
it with a **force** which few masculine intellects (in our own country at least) could have exerted over it.

This sentence makes for an interesting example because it is syntactically complex and because it includes contains difficultly ambiguous words such as "spell," "abstract," and "force."

In [7]:
sample = """She ‚Äúhas thrown her magical spell around the most abstract of Sciences."""

In [8]:
document = nlp(sample)

## spaCy Part-of-Speech Tagging

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ‚Äôs, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, ¬ß, ¬©, +, ‚àí, √ó, √∑, =, :), üòù             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |


Above is a POS chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy's POS tagging in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) on our sample `document` with the `style=` parameter set to "dep" (short for dependency parsing):

In [9]:
#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(document, style="dep", options=options)

As you can see, spaCy has correctly identified that "spell" and "force" are nouns in our sample sentence:

In [10]:
for token in document:
    if token.pos_ == "NOUN":
        print(token, token.pos_)

spell NOUN


But if we look at the same words in a different context ‚Äî in a sentence that I made up ‚Äî spaCy can identify when these words have changed  grammatical roles and meanings.

> You shouldn't **force** someone to learn how to **spell** Babbage. They just need practice. You can't **abstract** it.

In [11]:
document = nlp("You shouldn't force someone to learn how to spell Babbage. They just need practice. You can't abstract it.")

In [12]:
for token in document:
    if token.pos_ == "VERB":
        print(token, token.pos_)

force VERB
learn VERB
spell VERB
need VERB
abstract VERB


Where previously spaCy had identified "force" and "spell" as nouns, here spaCy correctly identifies the words "force," "spell," and "abstract" as verbs.

## Get Part-Of-Speech Tags

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.


In [13]:
for token in document:
    print(token.text, token.pos_, token.dep_)

You PRON nsubj
should AUX aux
n't PART neg
force VERB ROOT
someone PRON dobj
to PART aux
learn VERB xcomp
how SCONJ advmod
to PART aux
spell VERB xcomp
Babbage NOUN dobj
. PUNCT punct
They PRON nsubj
just ADV advmod
need VERB ROOT
practice NOUN dobj
. PUNCT punct
You PRON nsubj
ca AUX aux
n't PART neg
abstract VERB ROOT
it PRON dobj
. PUNCT punct


## Practicing with *Dracula*

In [14]:
filepath = "../texts/literature/Dracula_Bram-Stoker.txt"
document = nlp(open(filepath, encoding="utf-8").read())

## Get Adjectives

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |

To extract and count the adjectives in *Dracula*, we will follow the same model as above, except we'll add an `if` statement that will pull out words only if their POS label matches "ADJ."

:::{admonition} Python Review
:class: pythonreview
    
While we demonstrate how to extract parts of speech in the sections below, we're also going to reinforce some integral Python skills. Notice how we use `for` loops and `if` statements to `.append()` specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.
    
:::

Here we make a list of the adjectives identified in *Dracula*:

In [15]:
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.text)

In [16]:
adjs

['*',
 'available',
 'next',
 'wonderful',
 'little',
 'correct',
 'possible',
 'western',
 'splendid',
 'noble',
 'Turkish',
 'good',
 'red',
 'good',
 'thirsty',
 'national',
 'able',
 'German',
 'useful',
 'able',
 'extreme',
 'Carpathian',
 'wildest',
 'least',
 'able',
 'exact',
 'own',
 'distinct',
 'latter',
 'eleventh',
 'known',
 'imaginative',
 'interesting',
 'comfortable',
 'thirsty',
 'continuous',
 'more',
 'mamaliga',
 'excellent',
 'impletata',
 'little',
 'more',
 'further',
 'unpunctual',
 'full',
 'little',
 'steep',
 'such',
 'old',
 'wide',
 'subject',
 'great',
 'strong',
 'outside',
 'clear',
 'short',
 'round',
 'picturesque',
 'pretty',
 'clumsy',
 'full',
 'white',
 'other',
 'most',
 'big',
 'strangest',
 'barbarian',
 'big',
 'great',
 'baggy',
 'dirty',
 'white',
 'white',
 'enormous',
 'heavy',
 'high',
 'long',
 'black',
 'heavy',
 'black',
 'picturesque',
 'old',
 'Oriental',
 'harmless',
 'natural',
 'dark',
 'interesting',
 'old',
 'stormy',
 'great',


Then we count the unique adjectives in this list with the `Counter()` module:

In [17]:
adjs_tally = Counter(adjs)

In [18]:
adjs_tally.most_common()

[('good', 198),
 ('old', 188),
 ('other', 185),
 ('own', 184),
 ('more', 178),
 ('great', 173),
 ('poor', 171),
 ('little', 164),
 ('dear', 151),
 ('much', 148),
 ('such', 129),
 ('last', 115),
 ('same', 110),
 ('white', 103),
 ('many', 100),
 ('terrible', 99),
 ('full', 97),
 ('long', 90),
 ('few', 86),
 ('strange', 85),
 ('first', 78),
 ('new', 73),
 ('ready', 71),
 ('dead', 69),
 ('red', 67),
 ('whole', 66),
 ('open', 66),
 ('sweet', 65),
 ('dark', 60),
 ('strong', 59),
 ('very', 57),
 ('true', 54),
 ('heavy', 53),
 ('young', 53),
 ('quick', 48),
 ('able', 47),
 ('happy', 47),
 ('right', 47),
 ('asleep', 47),
 ('big', 44),
 ('small', 43),
 ('sure', 43),
 ('better', 43),
 ('best', 41),
 ('cold', 41),
 ('wild', 41),
 ('close', 41),
 ('free', 41),
 ('late', 40),
 ('certain', 40),
 ('present', 40),
 ('afraid', 39),
 ('high', 38),
 ('quiet', 37),
 ('pale', 36),
 ('silent', 35),
 ('glad', 35),
 ('usual', 33),
 ('sad', 33),
 ('possible', 32),
 ('bad', 32),
 ('least', 31),
 ('beautiful', 31

Then we make a dataframe from this list:

:::{admonition} Pandas Review
:class: pandasreview
    
 Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html"> Pandas Basics (1-3) </a> in this textbook!
    
:::

In [19]:
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]

Unnamed: 0,adj,count
0,good,198
1,old,188
2,other,185
3,own,184
4,more,178
5,great,173
6,poor,171
7,little,164
8,dear,151
9,much,148


## Get Nouns

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |

To extract and count nouns, we can follow the same model as above, except we will change our `if` statement to check for POS labels that match "NOUN".

In [20]:
nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.text)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]

Unnamed: 0,noun,count
0,time,385
1,night,314
2,man,251
3,room,231
4,way,223
5,day,218
6,hand,202
7,door,199
8,face,197
9,eyes,188


## Get Verbs

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| VERB  | verb                      | run, runs, running, eat, ate, eating          |

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the POS label "VERB").

:::{admonition} Python Review
:class: pythonreview
    
We can use a <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python">list comprehension</a> to get our list of verbs in a single line of code! Closely examine the first line of code below:
    
:::

In [21]:
verbs = [token.text for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

Unnamed: 0,verb,count
0,said,461
1,know,396
2,see,377
3,have,348
4,came,303
5,went,298
6,come,295
7,do,278
8,had,277
9,go,269


# Keyword Extraction

## Get Sentences with Keyword

spaCy can also identify sentences in a document. To access sentences, we can iterate through `document.sents` and pull out the `.text` of each sentence.

We can use spaCy's sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below.

With the function `find_sentences_with_keyword()`, we will iterate through `document.sents` and pull out any sentence that contains a particular "keyword." Then we will display these sentence with the keywords bolded.

In [22]:
import re
from IPython.display import Markdown, display

In [23]:
def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))

In [24]:
find_sentences_with_keyword(keyword="telegram", document=document)

"   _**telegram** from Arthur Holmwood to Quincey P. Morris.

"   _**telegram**, Arthur Holmwood to Seward.

You must send to me the **telegram** every day; and if there be cause I shall come again.

**telegram**, Seward, London, to Van Helsing, Amsterdam._  

"   _**telegram**, Seward, London, to Van Helsing, Amsterdam._  

"   _**telegram**, Seward, London, to Van Helsing, Amsterdam._  "_6 September._--Terrible change for the worse.

I hold over **telegram** to Holmwood till have seen you.

"I waited till I had seen you, as I said in my **telegram**.

A **telegram** came from Van Helsing at Amsterdam whilst I was at dinner, suggesting that I should be at Hillingham to-night, as it might be well to be at hand, and stating that he was leaving by the night mail and would join me early in the morning.         

**telegram**, Van Helsing, Antwerp, to Seward, Carfax._  (Sent to Carfax, Sussex, as no county given; delivered late by twenty-two hours.)  

The arrival of Van Helsing's **telegram** filled me with dismay.

Did you not get my **telegram**?"  I answered as quickly and coherently as I could that I had only got his **telegram** early in the morning, and had not lost a minute in coming here, and that I could not make any one in the house hear me.

"  He handed me a **telegram**:--  "Have not heard from Seward for three days, and am terribly anxious. Cannot leave.

"  In the hall I met Quincey Morris, with a **telegram** for Arthur telling him that Mrs. Westenra was dead; that Lucy also had been ill, but was now going on better; and that Van Helsing and I were with her.

*       *  _Later._--A sad home-coming in every way--the house empty of the dear soul who was so good to us; Jonathan still pale and dizzy under a slight relapse of his malady; and now a **telegram** from Van Helsing, whoever he may be:--  "You will be grieved to hear that Mrs. Westenra died five days ago, and that Lucy died the day before yesterday.

"   _**telegram**, Mrs. Harker to Van Helsing.

When we arrived at the Berkeley Hotel, Van Helsing found a **telegram** waiting for him:--       "Am coming up by train.

I have sent a **telegram** to Jonathan to come on here when he arrives in London from Whitby.

"  About half an hour after we had received Mrs. Harker's **telegram**, there came a quiet, resolute knock at the hall door.

"_Nota bene_, in Madam's **telegram** he went south from Carfax, that means he went to cross the river, and he could only do so at slack of tide, which should be something before one o'clock.

Lord Godalming went to the Consulate to see if any **telegram** had arrived for him, whilst the rest of us came on to this hotel--"the Odessus."

He had four **telegram**s, one each day since we started, and all to the same effect: that the _Czarina Catherine_ had not been reported to Lloyd's from anywhere.

He had arranged before leaving London that his agent should send him every day a **telegram** saying if the ship had been reported.

Daily **telegram**s to Godalming, but only the same story: "Not yet reported."

**telegram**, October 24th.

We were all wild with excitement yesterday when Godalming got his **telegram** from Lloyd's.

The **telegram**s from London have been the same: "no further report."

*       *       _28 October._--**telegram**.

"   _Dr. Seward's Diary._  _28 October._--When the **telegram** came announcing the arrival in Galatz I do not think it was such a shock to any of us as might have been expected.

## Get Keyword in Context

We can also find out about a keyword's more immediate context ‚Äî its neighboring words to the left and right ‚Äî and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what's called *ngrams*. "Ngrams" are any sequence of *n* tokens in a text. They're an important concept in computational linguistics and NLP. (Have you ever played with [Google's *Ngram* Viewer](https://books.google.com/ngrams)?)

Below we're going to make a list of *bigrams*, that is, all the two-word combinations from *Dracula*. We're going to use these bigrams to find the neighboring words that appear alongside particular keywords.

In [25]:
#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [26]:
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

In [27]:
bigrams = get_bigrams(tokens_and_labels)

Let's take a peek at the bigrams:

In [28]:
bigrams[5:20]

[[('by', 'ADP'), ('Bram', 'PROPN')],
 [('Bram', 'PROPN'), ('Stoker', 'PROPN')],
 [('Stoker', 'PROPN'), ('This', 'DET')],
 [('This', 'DET'), ('eBook', 'PROPN')],
 [('eBook', 'PROPN'), ('is', 'AUX')],
 [('is', 'AUX'), ('for', 'ADP')],
 [('for', 'ADP'), ('the', 'DET')],
 [('the', 'DET'), ('use', 'NOUN')],
 [('use', 'NOUN'), ('of', 'ADP')],
 [('of', 'ADP'), ('anyone', 'PRON')],
 [('anyone', 'PRON'), ('anywhere', 'ADV')],
 [('anywhere', 'ADV'), ('at', 'ADP')],
 [('at', 'ADP'), ('no', 'DET')],
 [('no', 'DET'), ('cost', 'NOUN')],
 [('cost', 'NOUN'), ('and', 'CCONJ')]]

Now that we have our list of bigrams, we're going to make a function `get_neighbor_words()`. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the `pos_label` parameter.

In [29]:
def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()

In [30]:
get_neighbor_words("telegram", bigrams)

[('a', 6),
 ('from', 3),
 ('seward', 3),
 ('arthur', 2),
 ('the', 2),
 ('to', 2),
 ('my', 2),
 ('i', 2),
 ('came', 2),
 ('helsing', 2),
 ('his', 2),
 ('harker', 2),
 ('morris', 1),
 ('every', 1),
 ('see', 1),
 ('day', 1),
 ('back', 1),
 ('over', 1),
 ('it', 1),
 ('van', 1),
 ('filled', 1),
 ('early', 1),
 ('for', 1),
 ('waiting', 1),
 ('there', 1),
 ('madam', 1),
 ('he', 1),
 ('any', 1),
 ('had', 1),
 ('saying', 1),
 ('masts', 1),
 ('october', 1)]

In [31]:
get_neighbor_words("telegram", bigrams, pos_label='VERB')

[('came', 2), ('see', 1), ('filled', 1), ('waiting', 1), ('saying', 1)]

## Your Turn!

Try out `find_sentences_with_keyword()` and `get_neighbor_words` with your own keywords of interest.

In [None]:
find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)

In [None]:
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)