Part-of-Speech Tagging for Chinese#

Note

This section, “Working in Languages Beyond English,” is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I’m grateful to Quinn for helping expand this textbook to serve languages beyond English.

In this lesson, we’re going to learn about the textual analysis methods part-of-speech tagging and keyword extraction for Chinese-language texts. These methods will help us computationally parse sentences and better understand words in context.


spaCy and Natural Language Processing (NLP)#

To computationally identify parts of speech, we’re going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. If you’ve used the preprocessing or named entity recognition notebooks for this language, you can skip the steps for installing spaCy and downloading the language model.

Install spaCy#

To use spaCy, we first need to install the library.

!pip install -U spacy

Import Libraries#

Then we’re going to import spacy and displacy, a special spaCy module for visualization.

import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

We’re also going to import the Counter module for counting nouns, verbs, adjectives, etc., and the pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).

Download Language Model#

Next we need to download the Chinese-language model (zh_core_web_md), which will be processing and making predictions about our texts. The Chinese model was trained on the OntoNotes annotated corpus. You can download the zh_core_web_md model by running the cell below:

!python -m spacy download zh_core_web_md
Collecting zh-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.1.0/zh_core_web_md-3.1.0-py3-none-any.whl (78.8 MB)
     |████████████████████████████████| 78.8 MB 22.3 MB/s eta 0:00:01
?25hCollecting spacy-pkuseg<0.1.0,>=0.0.27
  Downloading spacy_pkuseg-0.0.28-cp38-cp38-macosx_10_9_x86_64.whl (2.4 MB)
     |████████████████████████████████| 2.4 MB 7.4 MB/s eta 0:00:01
?25hRequirement already satisfied: spacy<3.2.0,>=3.1.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from zh-core-web-md==3.1.0) (3.1.2)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.8.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (4.59.0)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.7.4)
Requirement already satisfied: typer<0.4.0,>=0.3.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.3.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (3.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.8.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.0.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.25.1)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.7 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (3.0.8)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.4.1)
Requirement already satisfied: jinja2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.11.3)
Requirement already satisfied: thinc<8.1.0,>=8.0.8 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (8.0.8)
Requirement already satisfied: setuptools in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (52.0.0.post20210125)
Requirement already satisfied: packaging>=20.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (20.9)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.0.5)
Requirement already satisfied: pathy>=0.3.5 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.6.0)
Requirement already satisfied: numpy>=1.15.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.20.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.4 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.0.6)
Requirement already satisfied: pyparsing>=2.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from packaging>=20.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.4.7)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from pathy>=0.3.5->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (5.2.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (3.7.4.3)
Requirement already satisfied: idna<3,>=2.5 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2020.12.5)
Requirement already satisfied: cython>=0.25 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy-pkuseg<0.1.0,>=0.0.27->zh-core-web-md==3.1.0) (0.29.23)
Requirement already satisfied: click<7.2.0,>=7.1.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from typer<0.4.0,>=0.3.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from jinja2->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.1.1)
Installing collected packages: spacy-pkuseg, zh-core-web-md
Successfully installed spacy-pkuseg-0.0.28 zh-core-web-md-3.1.0
✔ Download and installation successful
You can now load the package via spacy.load('zh_core_web_md')

Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.

spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.

Load Language Model#

Once the model is downloaded, we need to load it with spacy.load() and assign it to the variable nlp.

nlp = spacy.load('zh_core_web_md')

Create a Processed spaCy Document#

Whenever we use spaCy, our first step will be to create a processed spaCy document with the loaded NLP model nlp(). Most of the heavy NLP lifting is done in this line of code. After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

filepath = '../texts/zh.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

spaCy Part-of-Speech Tagging#

The tags that spaCy uses for part-of-speech are based on work done by Universal Dependencies, an effort to create a set of part-of-speech tags that work across many different languages. Texts from various languages are annotated using this common set of tags, and contributed to a common repository that can be used to train models like spaCy.

The Universal Dependencies page has information about the annotated corpora available for each language; it’s worth looking into the corpora that were annotated for your language.

POS

Description

Examples

ADJ

adjective

big, old, green, incomprehensible, first

ADP

adposition

in, to, during

ADV

adverb

very, tomorrow, down, where, there

AUX

auxiliary

is, has (done), will (do), should (do)

CONJ

conjunction

and, or, but

CCONJ

coordinating conjunction

and, or, but

DET

determiner

a, an, the

INTJ

interjection

psst, ouch, bravo, hello

NOUN

noun

girl, cat, tree, air, beauty

NUM

numeral

1, 2017, one, seventy-seven, IV, MMXIV

PART

particle

’s, not,

PRON

pronoun

I, you, he, she, myself, themselves, somebody

PROPN

proper noun

Mary, John, London, NATO, HBO

PUNCT

punctuation

., (, ), ?

SCONJ

subordinating conjunction

if, while, that

SYM

symbol

$, %, §, ©, +, −, ×, ÷, =, :), 😝

VERB

verb

run, runs, running, eat, ate, eating

X

other

sfpksdpsxmsa

SPACE

space

Above is a POS chart taken from spaCy’s website, which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy’s POS tagging in action, we can use the spaCy module displacy on our sample document with the style= parameter set to “dep” (short for dependency parsing):

Get Part-Of-Speech Tags#

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the .lemma_ attribute for each token, which gives us the un-inflected version of the word. We’ll also pull out the .pos_ attribute for each token. We can get even finer-grained dependency information with the attribute .dep_.

for token in document:
    print(token.lemma_, token.pos_, token.dep_)
 PROPN nsubj
 PUNCT punct
 VERB ROOT
 PROPN nmod
 PROPN nmod
 ADJ amod
 NOUN dobj
 PUNCT punct
 SPACE nsubj
 PROPN dep
 NOUN advmod:loc
 PART case
 ADV advmod
 VERB amod
 PART mark
 NOUN dep
 PUNCT punct
 ADV advmod
 PRON dep
 PROPN dep
 ADJ amod
 NOUN dep
 PART discourse
 PUNCT ROOT
 ADP case
 NOUN nmod:prep
 VERB acl
 PUNCT punct
 VERB conj
 PART aux:asp
 ADJ amod
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB dep
 PART dep
 VERB conj
 VERB dep
 PART aux:asp
 NOUN compound:nn
 NOUN dobj
 PUNCT punct
 ADV neg
 VERB conj
 NOUN dobj
 PART mark
 PUNCT punct
 NOUN nsubj
 VERB ROOT
 PUNCT punct
 PUNCT punct
 VERB dep
 PUNCT punct
 ADV advmod
 VERB cop
 ADV dep
 VERB ccomp
 PART discourse
 PUNCT punct
 PUNCT punct
 PROPN nsubj
 VERB ROOT
 VERB advmod:rcomp
 VERB conj
 PUNCT punct
 ADV advmod
 VERB ROOT
 PART aux:asp
 PUNCT punct
 NOUN nmod:tmod
 VERB cop
 NOUN nmod:assmod
 PART case
 NOUN dobj
 PUNCT punct
 DET det
 NUM mark:clf
 NOUN dobj
 PUNCT punct
 NOUN conj
 PUNCT punct
 NOUN advmod:dvp
 PART mark
 VERB conj
 PUNCT punct
 ADV neg
 VERB conj
 NUM nmod:range
 NUM mark:clf
 PUNCT punct
 ADV advmod
 ADV neg
 VERB conj
 ADV advmod
 PUNCT punct
 ADV advmod
 X aux:ba
 NUM nummod
 NUM mark:clf
 VERB amod
 PROPN compound:vc
 PART mark
 PROPN dep
 PUNCT punct
 ADP case
 NOUN nmod:prep
 VERB conj
 PART aux:asp
 PUNCT punct
 ADV advmod
 VERB acl
 PART mark
 NOUN dep
 PUNCT punct
 ADV advmod
 ADV xcomp
 VERB conj
 ADV advmod
 PUNCT punct
 VERB conj
 PART aux:asp
 ADV advmod
 NOUN nsubj
 ADV advmod
 VERB conj
 PART discourse
 PUNCT punct
 NOUN nsubj
 ADV advmod
 VERB conj
 PART aux:asp
 PUNCT punct
 ADV advmod
 NOUN conj
 PUNCT punct
 NOUN conj
 PUNCT punct
 NOUN nmod:topic
 NUM dep
 NUM mark:clf
 PUNCT punct
 DET det
 NOUN compound:nn
 NOUN compound:nn
 NOUN compound:nn
 VERB conj
 PUNCT punct
 ADV dep
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB ROOT
 PUNCT punct
 VERB conj
 PART aux:asp
 NOUN nmod:assmod
 PART case
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB conj
 NUM nummod
 NUM mark:clf
 ADV advmod
 NOUN dobj
 PART discourse
 PUNCT punct
 SCONJ advmod
 NOUN nsubj
 VERB dep
 PUNCT punct
 ADV neg
 VERB conj
 NOUN compound:nn
 VERB dobj
 PUNCT punct
 NOUN nmod:assmod
 PART case
 NOUN compound:nn
 NOUN conj
 PUNCT punct
 NOUN compound:nn
 NOUN conj
 PUNCT punct
 ADV advmod
 ADV advmod
 VERB conj
 PART aux:asp
 PUNCT punct
 VERB dep
 PART aux:asp
 VERB acl
 PART mark
 NOUN dobj
 PUNCT punct
 ADP case
 NUM nummod
 NUM mark:clf
 VERB amod
 PART mark
 NOUN nmod:prep
 PUNCT punct
 VERB ROOT
 ADP case
 NOUN nmod:prep
 PUNCT punct
 NOUN nsubj
 ADV advmod
 ADV neg
 VERB aux:modal
 VERB conj
 PUNCT punct
 VERB dep
 PART aux:asp
 PRON dobj
 PUNCT punct
 SCONJ advmod
 VERB dep
 PART aux:asp
 NOUN dobj
 SCONJ advmod
 ADV neg
 VERB dep
 PUNCT punct
 ADV advmod
 ADV advmod
 VERB ROOT
 PUNCT punct
 ADV advmod
 VERB conj
 VERB ccomp
 VERB ccomp
 NOUN nsubj
 VERB ccomp
 PART aux:asp
 PUNCT punct
 VERB dep
 PART aux:asp
 ADV neg
 VERB dep
 PUNCT punct
 ADV advmod
 ADV advmod
 VERB ROOT
 PUNCT punct
 ADV advmod
 VERB ccomp
 PART aux:asp
 NOUN dobj
 PUNCT punct
 PUNCT punct
 ADV advmod
 PUNCT punct
 NOUN nsubj
 ADV neg
 VERB conj
 PUNCT punct
 PUNCT punct
 SCONJ advmod
 VERB dep
 NUM nummod
 NUM mark:clf
 NOUN ccomp
 PART discourse
 PUNCT punct
 CCONJ advmod
 VERB ROOT
 PART aux:asp
 NOUN dobj
 NUM nmod:range
 NUM mark:clf
 PUNCT punct
 VERB conj
 PART aux:asp
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB conj
 NOUN dobj
 ADP case
 NOUN nsubj
 VERB dep
 VERB advmod
 VERB conj
 PUNCT punct
 PUNCT punct
 ADV neg
 VERB ccomp
 PUNCT punct
 ADV neg
 VERB conj
 NOUN dobj
 PART discourse
 PUNCT punct
 PUNCT punct
 ADV advmod
 VERB dep
 PUNCT punct
 PRON nsubj
 VERB cop
 VERB dep
 ADV dobj
 ADV neg
 VERB dep
 PART discourse
 PUNCT ROOT
 ADV ROOT
 NUM nummod
 NUM mark:clf
 VERB amod
 PART mark
 NOUN dobj
 PUNCT punct
 NOUN nsubj
 VERB dep
 PART aux:asp
 PUNCT punct
 NOUN nsubj
 VERB xcomp
 VERB dep
 NUM nummod
 NOUN dobj
 PUNCT punct
 VERB conj
 NUM dobj
 PUNCT punct
 NOUN nsubj
 VERB dep
 PART aux:asp
 PUNCT punct
 NOUN nsubj
 ADV advmod
 VERB ROOT
 NUM nummod
 NUM mark:clf
 NOUN dobj
 PUNCT punct
 VERB conj
 NOUN nsubj
 VERB ccomp
 PART discourse
 PUNCT punct
 ADV advmod
 ADV advmod
 ADV neg
 VERB conj
 ADV punct
 NOUN nsubj
 VERB dep
 ADV advmod
 VERB conj
 NUM dep
 NUM mark:clf
 PUNCT punct
 ADV advmod
 VERB dep
 NOUN dobj
 PART nsubj
 ADV advmod
 ADV advmod
 VERB conj
 PUNCT punct
 ADJ amod
 NOUN nsubj
 ADV advmod
 VERB conj
 PART aux:asp
 PUNCT punct
 VERB dep
 NOUN dobj
 PUNCT punct
 NOUN nsubj
 ADV advmod
 VERB ROOT
 NOUN dobj
 PUNCT punct
 VERB ROOT
 NOUN dep
 VERB ccomp
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB conj
 DET det
 NOUN dobj
 VERB ccomp
 PART discourse
 PUNCT punct
 ADV advmod
 ADV advmod
 ADV dep
 NOUN dobj
 PART discourse
 PUNCT ROOT
 DET det
 NOUN nsubj
 PUNCT punct
 ADV advmod
 VERB ROOT
 PUNCT punct
 NOUN nsubj
 VERB cop
 NOUN ccomp
 PUNCT punct
 VERB conj
 VERB cop
 VERB ccomp
 NOUN dobj
 PART discourse
 PUNCT punct
 PUNCT punct
 ADV advmod
 ADV advmod
 X aux:ba
 NOUN dep
 VERB conj
 PROPN nmod:assmod
 PART case
 NOUN dobj
 ADV advmod
 VERB conj
 PUNCT punct
 ADV dep
 PUNCT punct
 ADV dep
 DET det
 NUM mark:clf
 NOUN dobj
 PART discourse
 PUNCT ROOT
 SPACE dep
 VERB ROOT
 PUNCT punct
 PRON nsubj
 VERB xcomp
 VERB dep
 NOUN compound:nn
 NOUN nsubj
 VERB cop
 VERB ccomp
 PART discourse
 PUNCT punct
 ADJ amod
 NOUN dep
 NOUN dobj
 VERB dep
 VERB dep
 PUNCT punct
 NOUN advmod
 DET det
 NOUN nsubj
 VERB ROOT
 DET det
 PUNCT punct
 NOUN amod
 PUNCT punct
 PUNCT punct
 PUNCT punct
 NOUN conj
 PUNCT punct
 PUNCT punct
 PUNCT punct
 NOUN compound:nn
 NOUN dep
 PUNCT punct
 DET det
 NOUN dobj
 PUNCT punct
 PRON nmod:poss
 NOUN nsubj
 SCONJ advmod
 VERB dep
 NOUN dobj
 PART mark
 PUNCT punct
 ADV advmod
 VERB aux:modal
 VERB conj
 NOUN dobj
 ADP case
 PRON nmod:prep
 VERB ccomp
 PUNCT punct
 PROPN nsubj
 VERB conj
 PART aux:asp
 DET det
 VERB acl
 PART mark
 NOUN dobj
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB dep
 NOUN dobj
 PART discourse
 PUNCT punct
 ADV advmod
 VERB aux:modal
 VERB conj
 VERB ccomp
 ADV advmod
 PUNCT punct
 ADV advmod
 PRON nsubj
 VERB conj
 PART aux:asp
 PRON nmod:assmod
 PART case
 NOUN dobj
 PUNCT punct
 PRON nsubj
 ADV neg
 VERB aux:modal
 ADV neg
 VERB dep
 PART discourse
 PART discourse
 PUNCT ROOT
 NOUN nsubj
 VERB ROOT
 PRON dep
 VERB ccomp
 NOUN dobj
 PUNCT punct
 VERB conj
 NOUN dobj
 PUNCT punct
 VERB conj
 PRON nmod:assmod
 PART case
 NOUN dobj
 PUNCT punct
 VERB conj
 PRON dobj
 VERB ccomp
 PUNCT punct
 PRON nsubj
 ADV advmod
 ADV neg
 VERB aux:modal
 ADP case
 PRON nmod:prep
 VERB dep
 PUNCT punct
 ADV advmod
 VERB conj
 PART aux:asp
 PART discourse
 PUNCT punct
 PRON nsubj
 ADV advmod
 PRON nmod:poss
 NOUN compound:nn
 PRON nsubj
 VERB ROOT
 NOUN dobj
 PUNCT punct
 ADV advmod
 NOUN nsubj
 ADV advmod
 VERB conj
 NOUN dobj
 VERB conj
 PART aux:asp
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB advmod
 VERB conj
 PUNCT punct
 NOUN dobj
 PUNCT punct
 NOUN nsubj
 VERB ROOT
 PRON nsubj
 VERB ccomp
 PUNCT punct
 PRON nsubj
 ADV advmod
 ADV neg
 VERB compound:vc
 VERB conj
 PRON nsubj
 VERB ccomp
 PUNCT punct
 SCONJ advmod
 NOUN advmod:loc
 PART case
 VERB dep
 PUNCT punct
 ADV advmod
 VERB conj
 NOUN dobj
 ADV advmod
 ADV neg
 VERB ccomp
 PART discourse
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB ROOT
 VERB ccomp
 PUNCT punct
 ADV advmod
 VERB nsubj
 ADV neg
 VERB conj
 PUNCT punct
 ADV advmod
 VERB conj
 NOUN nsubj
 VERB ccomp
 NOUN nsubj
 VERB ccomp
 PUNCT punct
 ADV advmod
 ADV advmod
 X aux:ba
 PRON dep
 VERB conj
 PART aux:asp
 PUNCT punct
 VERB conj
 NOUN dobj
 VERB ccomp
 VERB ccomp
 PUNCT punct
 NOUN nsubj
 VERB aux:modal
 ADV advmod
 VERB conj
 NOUN dobj
 PUNCT punct
 ADP case
 ADV neg
 VERB nmod:prep
 PRON nsubj
 VERB ccomp
 PUNCT punct
 VERB conj
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB cop
 VERB ROOT
 PART discourse
 PUNCT punct
 VERB conj
 NOUN dobj
 ADV neg
 VERB conj
 PART discourse
 PUNCT punct
 ADV advmod
 VERB ROOT
 PUNCT punct
 NOUN dep
 VERB dep
 NOUN nsubj
 VERB dobj
 PART discourse
 PART discourse
 PUNCT punct
 ADV advmod
 VERB cop
 VERB acl
 NOUN dobj
 PUNCT punct
 VERB conj
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB conj
 PART mark
 NOUN nsubj
 VERB dep
 PART aux:asp
 NOUN dobj
 PUNCT punct
 PRON nsubj
 VERB conj
 PRON nmod:assmod
 PART case
 NOUN dobj
 PART discourse
 PUNCT ROOT
 ADV advmod
 VERB ROOT
 PART aux:asp
 PRON nmod:assmod
 PART case
 NOUN dobj
 PUNCT punct
 ADV advmod
 ADV neg
 VERB dep
 NOUN dobj
 PART discourse
 PUNCT punct
 VERB ROOT
 PUNCT punct
 ADV advmod
 ADV advmod
 VERB conj
 PART dep
 NOUN dobj
 PART discourse
 PUNCT punct
 DET det
 NOUN nsubj
 PUNCT punct
 VERB dep
 VERB advmod:rcomp
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB ROOT
 VERB ccomp
 PUNCT punct
 ADV advmod
 VERB dep
 ADV advmod
 NUM dep
 NOUN dobj
 PUNCT punct
 ADV advmod
 ADV advmod
 ADP case
 PRON nmod:prep
 VERB conj
 PUNCT punct
 SPACE dep
 ADV advmod
 ADV advmod
 PUNCT punct
 PRON nsubj
 VERB ROOT
 PRON compound:nn
 NOUN dobj
 PUNCT punct
 X aux:ba
 NOUN det
 NOUN dep
 PUNCT punct
 ADV advmod
 PROPN dep
 PUNCT punct
 X aux:ba
 NOUN amod
 NOUN dep
 PUNCT punct
 ADV advmod
 VERB conj
 PUNCT punct
 ADV advmod
 ADP case
 VERB nmod:prep
 PART aux:asp
 PUNCT punct
 NOUN nmod:tmod
 ADV advmod
 VERB advmod
 VERB conj
 PART discourse
 PUNCT punct
 VERB dep
 PART mark
 PART dep
 PUNCT punct
 ADV advmod
 VERB ROOT
 PUNCT punct
 ADJ dep
 ADV advmod
 VERB ccomp
 PUNCT punct
 PUNCT punct
 VERB conj
 NOUN nsubj
 ADV advmod
 VERB xcomp
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB conj
 PRON dobj
 ADP punct
 NOUN nsubj
 VERB conj
 PUNCT punct
 VERB xcomp
 VERB compound:vc
 VERB conj
 PUNCT punct
 ADV advmod
 VERB conj
 PRON dobj
 PUNCT punct
 NOUN nsubj
 VERB dep
 NOUN dobj
 PART mark
 PUNCT punct
 ADV advmod
 ADV advmod
 VERB ROOT
 PART aux:asp
 NOUN nmod:assmod
 PART case
 NOUN dobj
 PUNCT punct
 VERB conj
 PRON dobj
 VERB ccomp
 VERB compound:vc
 PUNCT punct
 NOUN nsubj
 VERB conj
 PUNCT punct
 NOUN nsubj
 ADV neg
 ADV advmod
 VERB conj
 PART aux:asp
 NOUN dobj
 PUNCT punct
 VERB xcomp
 VERB conj
 PRON dobj
 NOUN dobj
 PUNCT punct
 NOUN nsubj
 ADV advmod
 VERB cop
 VERB conj
 PUNCT punct
 ADV advmod
 ADV advmod
 ADP case
 PRON nmod:prep
 VERB conj
 PUNCT punct
 NOUN compound:nn
 NOUN dep
 PART case
 PART dep
 PUNCT punct
 SCONJ advmod
 VERB aux:modal
 NOUN dep
 ADV advmod
 VERB dep
 ADV advmod
 VERB ROOT
 PUNCT punct
 ADP case
 NOUN nmod:prep
 ADV advmod
 VERB xcomp
 ADV advmod
 VERB conj
 NOUN dobj
 PUNCT punct
 VERB conj
 PUNCT punct
 ADJ advmod
 VERB dep
 PART mark
 PART dep
 PUNCT punct
 VERB xcomp
 VERB ROOT
 NOUN compound:nn
 NOUN conj
 PUNCT punct
 NOUN dobj
 PUNCT punct
 VERB conj
 DET det
 ADP case
 NOUN nmod:prep
 VERB amod
 PART mark
 NOUN dobj
 PUNCT punct
 VERB dep
 PART mark
 PART dep
 PUNCT punct
 VERB xcomp
 VERB ROOT
 PART aux:asp
 NOUN compound:nn
 NOUN dobj
 PUNCT punct
 ADV advmod
 VERB conj
 VERB dobj
 PUNCT punct
 ADV nsubj
 VERB cop
 PRON nmod:assmod
 PART case
 NOUN ROOT
 PART discourse
 PUNCT punct
 ADV dep
 VERB dep
 NOUN nsubj
 NOUN ccomp
 PART mark
 PART discourse
 PUNCT punct
 NOUN dep
 PRON nsubj
 ADV advmod
 ADV neg
 VERB ROOT
 PUNCT punct
 PRON nsubj
 ADV advmod
 VERB conj
 VERB ccomp
 PRON dobj
 PART discourse
 PUNCT punct
 PRON dep
 PRON nsubj
 ADV advmod
 VERB ROOT
 PUNCT punct
 VERB acl
 NOUN dobj
 PART mark
 NOUN nmod:prep
 PUNCT punct
 ADV advmod
 ADV advmod
 VERB dep
 PART discourse
 PUNCT punct
 DET det
 NUM nsubj
 NUM dep
 ADV neg
 VERB aux:modal
 VERB ROOT
 PRON nmod:assmod
 PART case
 NOUN nsubj
 ADV advmod
 VERB ccomp
 PART discourse
 PUNCT punct

Practicing with the example text#

When working with languages that have inflection, we typically use token.lemma_ instead of token.text like you’ll find in the English examples. This is important when we’re counting, so that differently-inflected forms of a word (e.g. masculine vs. feminine or singular vs. plural) aren’t counted as if they were different words.

filepath = "../texts/zh.txt"
document = nlp(open(filepath, encoding="utf-8").read())

Get Adjectives#

POS

Description

Examples

ADJ

adjective

big, old, green, incomprehensible, first

To extract and count the adjectives in the example text, we will follow the same model as above, except we’ll add an if statement that will pull out words only if their POS label matches “ADJ.”

Python Review

While we demonstrate how to extract parts of speech in the sections below, we’re also going to reinforce some integral Python skills. Notice how we use for loops and if statements to .append() specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.

Here we make a list of the adjectives identified in the example text:

adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.lemma_)
adjs
['', '', '', '', '', '', '']

Then we count the unique adjectives in this list with the Counter() module:

adjs_tally = Counter(adjs)
adjs_tally.most_common()
[('', 7)]

Then we make a dataframe from this list:

df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]
adj count
0 7

Get Nouns#

POS

Description

Examples

NOUN

noun

girl, cat, tree, air, beauty

To extract and count nouns, we can follow the same model as above, except we will change our if statement to check for POS labels that match “NOUN”.

nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.lemma_)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]
noun count
0 159

Get Verbs#

POS

Description

Examples

VERB

verb

run, runs, running, eat, ate, eating

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if statement to match the POS label “VERB”).

Python Review

We can use a list comprehension to get our list of verbs in a single line of code! Closely examine the first line of code below:

verbs = [token.lemma_ for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]
verb count
0 232

Keyword Extraction#

Get Sentences with Keyword#

spaCy can also identify sentences in a document. To access sentences, we can iterate through document.sents and pull out the .text of each sentence.

We can use spaCy’s sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below. Note that the function assumes that the keyword provided will be exactly the same as it appears in the text.

With the function find_sentences_with_keyword(), we will iterate through document.sents and pull out any sentence that contains a particular “keyword.” Then we will display these sentence with the keywords bolded.

import re
from IPython.display import Markdown, display
def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))
find_sentences_with_keyword(keyword="男人", document=document)

这还不说,到了择亲的时光,只凭着两个不要脸媒人的话,只要男家有钱有势,不问身家清白,男人的性情好坏、学问高低,就不知不觉应了。

到了那边,要是遇着男人虽不怎么样,却还安分,这就算前生有福今生受了。

要是说一二句抱怨的话,或是劝了男人几句,反了腔,就打骂俱下;别人听见还要说:“不贤惠,不晓得妇道呢!”

女子死了,男人只带几根蓝辫线,有嫌难看的,连带也不带;人死还没三天,就出去偷鸡摸狗;七还未尽,新娘子早已进门了。

自己又看看无功受禄,恐怕行不长久,一听见男子喜欢脚小,就急急忙忙把它缠了,使男人看见喜欢,庶可以藉此吃白饭。

自然是有学问、有见识、出力作事的男人得了权利,我们作他的奴隶了。

诸位晓得国是要亡的了,男人自己也不保,我们还想靠他么?

Get Keyword in Context#

We can also find out about a keyword’s more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what’s called ngrams. “Ngrams” are any sequence of n tokens in a text. They’re an important concept in computational linguistics and NLP. (Have you ever played with Google’s Ngram Viewer?)

Below we’re going to make a list of bigrams, that is, all the two-word combinations from the sample text. We’re going to use these bigrams to find the neighboring words that appear alongside particular keywords.

#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams
bigrams = get_bigrams(tokens_and_labels)

Let’s take a peek at the bigrams:

bigrams[5:20]
[[('同胞', 'NOUN'), ('世界', 'NOUN')],
 [('世界', 'NOUN'), ('上', 'PART')],
 [('上', 'PART'), ('最', 'ADV')],
 [('最', 'ADV'), ('不平', 'VERB')],
 [('不平', 'VERB'), ('的', 'PART')],
 [('的', 'PART'), ('事', 'NOUN')],
 [('事', 'NOUN'), ('就是', 'ADV')],
 [('就是', 'ADV'), ('我们', 'PRON')],
 [('我们', 'PRON'), ('二万万', 'PROPN')],
 [('二万万', 'PROPN'), ('女', 'ADJ')],
 [('女', 'ADJ'), ('同胞', 'NOUN')],
 [('同胞', 'NOUN'), ('了', 'PART')],
 [('了', 'PART'), ('从', 'ADP')],
 [('从', 'ADP'), ('小生', 'NOUN')],
 [('小生', 'NOUN'), ('下来', 'VERB')]]

Now that we have our list of bigrams, we’re going to make a function get_neighbor_words(). This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the pos_label parameter.

def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()
get_neighbor_words("男人", bigrams)
[('了', 3),
 ('的', 2),
 ('清白', 1),
 ('着', 1),
 ('虽', 1),
 ('几', 1),
 ('只', 1),
 ('使', 1),
 ('看见', 1),
 ('得', 1),
 ('自己', 1)]
get_neighbor_words("男人", bigrams, pos_label='VERB')
[('清白', 1), ('使', 1), ('看见', 1), ('得', 1)]

Your Turn!#

Try out find_sentences_with_keyword() and get_neighbor_words with your own keywords of interest.

find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)