Part-of-Speech Tagging for Chinese#
Note
This section, “Working in Languages Beyond English,” is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I’m grateful to Quinn for helping expand this textbook to serve languages beyond English.
In this lesson, we’re going to learn about the textual analysis methods part-of-speech tagging and keyword extraction for Chinese-language texts. These methods will help us computationally parse sentences and better understand words in context.
spaCy and Natural Language Processing (NLP)#
To computationally identify parts of speech, we’re going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.
To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. If you’ve used the preprocessing or named entity recognition notebooks for this language, you can skip the steps for installing spaCy and downloading the language model.
Install spaCy#
To use spaCy, we first need to install the library.
!pip install -U spacy
Import Libraries#
Then we’re going to import spacy
and displacy
, a special spaCy module for visualization.
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)
We’re also going to import the Counter
module for counting nouns, verbs, adjectives, etc., and the pandas
library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).
Download Language Model#
Next we need to download the Chinese-language model (zh_core_web_md
), which will be processing and making predictions about our texts. The Chinese model was trained on the OntoNotes annotated corpus. You can download the zh_core_web_md
model by running the cell below:
!python -m spacy download zh_core_web_md
Collecting zh-core-web-md==3.1.0
Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.1.0/zh_core_web_md-3.1.0-py3-none-any.whl (78.8 MB)
|████████████████████████████████| 78.8 MB 22.3 MB/s eta 0:00:01
?25hCollecting spacy-pkuseg<0.1.0,>=0.0.27
Downloading spacy_pkuseg-0.0.28-cp38-cp38-macosx_10_9_x86_64.whl (2.4 MB)
|████████████████████████████████| 2.4 MB 7.4 MB/s eta 0:00:01
?25hRequirement already satisfied: spacy<3.2.0,>=3.1.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from zh-core-web-md==3.1.0) (3.1.2)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.8.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (4.59.0)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.7.4)
Requirement already satisfied: typer<0.4.0,>=0.3.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.3.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (3.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.8.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.0.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.25.1)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.7 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (3.0.8)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.4.1)
Requirement already satisfied: jinja2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.11.3)
Requirement already satisfied: thinc<8.1.0,>=8.0.8 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (8.0.8)
Requirement already satisfied: setuptools in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (52.0.0.post20210125)
Requirement already satisfied: packaging>=20.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (20.9)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.0.5)
Requirement already satisfied: pathy>=0.3.5 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (0.6.0)
Requirement already satisfied: numpy>=1.15.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.20.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.4 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.0.6)
Requirement already satisfied: pyparsing>=2.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from packaging>=20.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.4.7)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from pathy>=0.3.5->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (5.2.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (3.7.4.3)
Requirement already satisfied: idna<3,>=2.5 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (2020.12.5)
Requirement already satisfied: cython>=0.25 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from spacy-pkuseg<0.1.0,>=0.0.27->zh-core-web-md==3.1.0) (0.29.23)
Requirement already satisfied: click<7.2.0,>=7.1.1 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from typer<0.4.0,>=0.3.0->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages (from jinja2->spacy<3.2.0,>=3.1.0->zh-core-web-md==3.1.0) (1.1.1)
Installing collected packages: spacy-pkuseg, zh-core-web-md
Successfully installed spacy-pkuseg-0.0.28 zh-core-web-md-3.1.0
✔ Download and installation successful
You can now load the package via spacy.load('zh_core_web_md')
Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.
spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.
Load Language Model#
Once the model is downloaded, we need to load it with spacy.load()
and assign it to the variable nlp
.
nlp = spacy.load('zh_core_web_md')
Create a Processed spaCy Document#
Whenever we use spaCy, our first step will be to create a processed spaCy document
with the loaded NLP model nlp()
. Most of the heavy NLP lifting is done in this line of code. After processing, the document
object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.
filepath = '../texts/zh.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)
spaCy Part-of-Speech Tagging#
The tags that spaCy uses for part-of-speech are based on work done by Universal Dependencies, an effort to create a set of part-of-speech tags that work across many different languages. Texts from various languages are annotated using this common set of tags, and contributed to a common repository that can be used to train models like spaCy.
The Universal Dependencies page has information about the annotated corpora available for each language; it’s worth looking into the corpora that were annotated for your language.
POS |
Description |
Examples |
---|---|---|
ADJ |
adjective |
big, old, green, incomprehensible, first |
ADP |
adposition |
in, to, during |
ADV |
adverb |
very, tomorrow, down, where, there |
AUX |
auxiliary |
is, has (done), will (do), should (do) |
CONJ |
conjunction |
and, or, but |
CCONJ |
coordinating conjunction |
and, or, but |
DET |
determiner |
a, an, the |
INTJ |
interjection |
psst, ouch, bravo, hello |
NOUN |
noun |
girl, cat, tree, air, beauty |
NUM |
numeral |
1, 2017, one, seventy-seven, IV, MMXIV |
PART |
particle |
’s, not, |
PRON |
pronoun |
I, you, he, she, myself, themselves, somebody |
PROPN |
proper noun |
Mary, John, London, NATO, HBO |
PUNCT |
punctuation |
., (, ), ? |
SCONJ |
subordinating conjunction |
if, while, that |
SYM |
symbol |
$, %, §, ©, +, −, ×, ÷, =, :), 😝 |
VERB |
verb |
run, runs, running, eat, ate, eating |
X |
other |
sfpksdpsxmsa |
SPACE |
space |
Above is a POS chart taken from spaCy’s website, which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy’s POS tagging in action, we can use the spaCy module displacy
on our sample document
with the style=
parameter set to “dep” (short for dependency parsing):
Practicing with the example text#
When working with languages that have inflection, we typically use token.lemma_
instead of token.text
like you’ll find in the English examples. This is important when we’re counting, so that differently-inflected forms of a word (e.g. masculine vs. feminine or singular vs. plural) aren’t counted as if they were different words.
filepath = "../texts/zh.txt"
document = nlp(open(filepath, encoding="utf-8").read())
Get Adjectives#
POS |
Description |
Examples |
---|---|---|
ADJ |
adjective |
big, old, green, incomprehensible, first |
To extract and count the adjectives in the example text, we will follow the same model as above, except we’ll add an if
statement that will pull out words only if their POS label matches “ADJ.”
Python Review
While we demonstrate how to extract parts of speech in the sections below, we’re also going to reinforce some integral Python skills. Notice how we use for
loops and if
statements to .append()
specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.
Here we make a list of the adjectives identified in the example text:
adjs = []
for token in document:
if token.pos_ == 'ADJ':
adjs.append(token.lemma_)
adjs
['', '', '', '', '', '', '']
Then we count the unique adjectives in this list with the Counter()
module:
adjs_tally = Counter(adjs)
adjs_tally.most_common()
[('', 7)]
Then we make a dataframe from this list:
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]
adj | count | |
---|---|---|
0 | 7 |
Get Nouns#
POS |
Description |
Examples |
---|---|---|
NOUN |
noun |
girl, cat, tree, air, beauty |
To extract and count nouns, we can follow the same model as above, except we will change our if
statement to check for POS labels that match “NOUN”.
nouns = []
for token in document:
if token.pos_ == 'NOUN':
nouns.append(token.lemma_)
nouns_tally = Counter(nouns)
df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]
noun | count | |
---|---|---|
0 | 159 |
Get Verbs#
POS |
Description |
Examples |
---|---|---|
VERB |
verb |
run, runs, running, eat, ate, eating |
To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if
statement to match the POS label “VERB”).
Python Review
We can use a list comprehension to get our list of verbs in a single line of code! Closely examine the first line of code below:
verbs = [token.lemma_ for token in document if token.pos_ == 'VERB']
verbs_tally = Counter(verbs)
df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]
verb | count | |
---|---|---|
0 | 232 |
Keyword Extraction#
Get Sentences with Keyword#
spaCy can also identify sentences in a document. To access sentences, we can iterate through document.sents
and pull out the .text
of each sentence.
We can use spaCy’s sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below. Note that the function assumes that the keyword provided will be exactly the same as it appears in the text.
With the function find_sentences_with_keyword()
, we will iterate through document.sents
and pull out any sentence that contains a particular “keyword.” Then we will display these sentence with the keywords bolded.
import re
from IPython.display import Markdown, display
def find_sentences_with_keyword(keyword, document):
#Iterate through all the sentences in the document and pull out the text of each sentence
for sentence in document.sents:
sentence = sentence.text
#Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
if keyword.lower() in sentence.lower():
#Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
sentence = re.sub('\n', ' ', sentence)
sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
display(Markdown(sentence))
find_sentences_with_keyword(keyword="男人", document=document)
这还不说,到了择亲的时光,只凭着两个不要脸媒人的话,只要男家有钱有势,不问身家清白,男人的性情好坏、学问高低,就不知不觉应了。
到了那边,要是遇着男人虽不怎么样,却还安分,这就算前生有福今生受了。
要是说一二句抱怨的话,或是劝了男人几句,反了腔,就打骂俱下;别人听见还要说:“不贤惠,不晓得妇道呢!”
女子死了,男人只带几根蓝辫线,有嫌难看的,连带也不带;人死还没三天,就出去偷鸡摸狗;七还未尽,新娘子早已进门了。
自己又看看无功受禄,恐怕行不长久,一听见男子喜欢脚小,就急急忙忙把它缠了,使男人看见喜欢,庶可以藉此吃白饭。
自然是有学问、有见识、出力作事的男人得了权利,我们作他的奴隶了。
诸位晓得国是要亡的了,男人自己也不保,我们还想靠他么?
Get Keyword in Context#
We can also find out about a keyword’s more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging.
To do so, we will first create a list of what’s called ngrams. “Ngrams” are any sequence of n tokens in a text. They’re an important concept in computational linguistics and NLP. (Have you ever played with Google’s Ngram Viewer?)
Below we’re going to make a list of bigrams, that is, all the two-word combinations from the sample text. We’re going to use these bigrams to find the neighboring words that appear alongside particular keywords.
#Make a list of tokens and POS labels from document if the token is a word
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
ngrams = []
adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
#Loop through numbers from 0 to the (slightly adjusted) length of your word list
for word_index in range(adj_length_of_word_list):
#Index the list at each number, grabbing the word at that number index as well as N number of words after it
ngram = word_list[word_index : word_index + number_consecutive_words]
#Append this word combo to the master list "ngrams"
ngrams.append(ngram)
return ngrams
bigrams = get_bigrams(tokens_and_labels)
Let’s take a peek at the bigrams:
bigrams[5:20]
[[('同胞', 'NOUN'), ('世界', 'NOUN')],
[('世界', 'NOUN'), ('上', 'PART')],
[('上', 'PART'), ('最', 'ADV')],
[('最', 'ADV'), ('不平', 'VERB')],
[('不平', 'VERB'), ('的', 'PART')],
[('的', 'PART'), ('事', 'NOUN')],
[('事', 'NOUN'), ('就是', 'ADV')],
[('就是', 'ADV'), ('我们', 'PRON')],
[('我们', 'PRON'), ('二万万', 'PROPN')],
[('二万万', 'PROPN'), ('女', 'ADJ')],
[('女', 'ADJ'), ('同胞', 'NOUN')],
[('同胞', 'NOUN'), ('了', 'PART')],
[('了', 'PART'), ('从', 'ADP')],
[('从', 'ADP'), ('小生', 'NOUN')],
[('小生', 'NOUN'), ('下来', 'VERB')]]
Now that we have our list of bigrams, we’re going to make a function get_neighbor_words()
. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the pos_label
parameter.
def get_neighbor_words(keyword, bigrams, pos_label = None):
neighbor_words = []
keyword = keyword.lower()
for bigram in bigrams:
#Extract just the lowercased words (not the labels) for each bigram
words = [word.lower() for word, label in bigram]
#Check to see if keyword is in the bigram
if keyword in words:
for word, label in bigram:
#Now focus on the neighbor word, not the keyword
if word.lower() != keyword:
#If the neighbor word matches the right pos_label, append it to the master list
if label == pos_label or pos_label == None:
neighbor_words.append(word.lower())
return Counter(neighbor_words).most_common()
get_neighbor_words("男人", bigrams)
[('了', 3),
('的', 2),
('清白', 1),
('着', 1),
('虽', 1),
('几', 1),
('只', 1),
('使', 1),
('看见', 1),
('得', 1),
('自己', 1)]
get_neighbor_words("男人", bigrams, pos_label='VERB')
[('清白', 1), ('使', 1), ('看见', 1), ('得', 1)]
Your Turn!#
Try out find_sentences_with_keyword()
and get_neighbor_words
with your own keywords of interest.
find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)