Named Entity Recognition for Chinese#
Note
This section, “Working in Languages Beyond English,” is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I’m grateful to Quinn for helping expand this textbook to serve languages beyond English.
In this lesson, we’re going to learn about a text analysis method called Named Entity Recognition (NER) as applied to Chinese. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.
Dataset#
The example text for Chinese is 敬告中国二万万女同胞 by 秋瑾. (Thanks to Paul Vierthaler for selecting and finding the text.)
Here’s a preview of spaCy’s NER tagging 敬告中国二万万女同胞.
If you compare the results to the English example, you’ll notice that the Chinese NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.
You can read more about the data sources used to train Chinese on the spaCy model page.
唉!世界上最不平的事,就是我们 二万万 CARDINAL 女同胞了。从小生下来,遇着好 老子 PERSON ,还说得过;遇着脾气杂冒、不讲情理的,满嘴连说:“晦气,又是一个没用的。”恨不得拿起来摔死。总抱着“将来是别人家的人”这句话,冷一眼、白一眼地看待;没到几岁,也不问好歹,就把 一 CARDINAL 双雪白粉嫩的 天足脚 ORG ,用白布缠着,连睡觉的时候,也不许放松一点,到了后来肉也烂尽了,骨也折断了,不过讨亲戚、朋友、邻居们一声“某人家姑娘脚小”罢了。这还不说,到了择亲的时光,只凭着 两 CARDINAL 个不要脸媒人的话,只要男家有钱有势,不问身家清白,男人的性情好坏、学问高低,就不知不觉应了。到了过门的时候,用一顶红红绿绿的花轿,坐在里面,连气也不能出。到了那边,要是遇着男人虽不怎么样,却还安分,这就算前生有福今生受了。遇着不好的,总不是说“前生作了孽”,就是说“运气不好”。要是说 一二 CARDINAL 句抱怨的话,或是劝了男人几句,反了腔,就打骂俱下;别人听见还要说:“不贤惠,不晓得妇道呢!”诸位听听,这不是有冤没处诉么?还有一桩不公的事:男子死了,女子就要带 三年 DATE 孝,不许二嫁。女子死了,男人只带几根蓝辫线,有嫌难看的,连带也不带;人死还没 三天 DATE ,就出去偷鸡摸狗;七还未尽,新娘子早已进门了。上天生人,男女原没有分别。试问天下没有女人,就生出这些人来么?为什么这样不公道呢?那些男子,天天说“心是公的,待人是要和平的”,又为什么把女子当作 非洲 LOC 的黑奴一样看待。不公不平,直到这步田地呢?
诸位,你要知道天下事靠人是不行的,总要求己为是。当初那些腐儒说什么“男尊女卑”、“女子无才便是德”、“夫为妻纲”这些胡说,我们女子要是有志气的,就应当号召同志与他反对,陈后主兴了这缠足的例子,我们要是有羞耻的,就应当兴师问罪;即不然,难道他捆着我的腿?我不会不缠的么?男子怕我们有知识、有学问、爬上他们的头,不准我们求学,我们难道不会和他分辨,就应了么?这总是我们女子自己放弃责任,样样事体一见男子做了,自己就乐得偷懒, 图安乐 PERSON 。男子说我没用,我就没用;说我不行,只要保着眼前舒服,就作奴隶也不问了。自己又看看无功受禄,恐怕行不长久,一听见男子喜欢脚小,就急急忙忙把它缠了,使男人看见喜欢,庶可以藉此吃白饭。至于不叫我们读书、习字,这更是求之不得的,有甚么不赞成呢?诸位想想,天下有享现成福的么?自然是有学问、有见识、出力作事的男人得了权利,我们作他的奴隶了。既作了他的奴隶,怎么不受压制呢?自作自受,又怎么怨得人呢?这些事情,提起来,我也觉得难过,诸位想想总是个中人,亦不必用我细说。
但是从此以后,我还望我们姐妹们,把从前事情,一概搁开,把以后事情,尽力作去,譬如从前死了,现在又转世为人了。老的呢,不要说“老而无用”,遇见丈夫好的要开学堂,不要阻他;儿子好的,要出洋留学,不要阻他。中年作媳妇的,总不要拖着丈夫的腿,使他气短志颓,功不成、名不就;生了儿子,就要送他进学堂,女儿也是如此,千万不要替他缠足。幼年姑娘的呢,若能够进学堂更好;就不进学堂,在家里也要常看书、习字。有钱作官的呢,就要劝丈夫 开学堂 PERSON 、 兴工厂 NORP ,作那些与百姓有益的事情。无钱的呢,就要帮着丈夫苦作,不要偷懒吃闲饭。这就是我的望头了。诸位晓得国是要亡的了,男人自己也不保,我们还想靠他么?我们自己要不振作,到国亡的时候,那就迟了。诸位!诸位!须不可以打断我的念头才好呢!
NER with spaCy#
If you’ve already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.
Install spaCy#
!pip install -U spacy
Import Libraries#
We’re going to import spacy
and displacy
, a special spaCy module for visualization.
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400
We’re also going to import the Counter
module for counting people, places, and things, and the pandas
library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).
Download Language Model#
Next we need to download the Chinese-language model (zh_core_web_md
), which will be processing and making predictions about our texts. You can read more about the data sources used to train Chinese on the spaCy model page.
!python -m spacy download zh_core_web_md
Collecting zh-core-web-md==3.7.0
Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.7.0/zh_core_web_md-3.7.0-py3-none-any.whl (78.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.0/78.0 MB 8.1 MB/s eta 0:00:0000:01m0:01m
?25hRequirement already satisfied: spacy<3.8.0,>=3.7.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from zh-core-web-md==3.7.0) (3.7.4)
Collecting spacy-pkuseg<0.1.0,>=0.0.27 (from zh-core-web-md==3.7.0)
Obtaining dependency information for spacy-pkuseg<0.1.0,>=0.0.27 from https://files.pythonhosted.org/packages/2d/52/64b4692503d8e920437c1defa0ba4b94fd8a54f252a070119c6c87cbca52/spacy_pkuseg-0.0.33-cp311-cp311-macosx_11_0_arm64.whl.metadata
Downloading spacy_pkuseg-0.0.33-cp311-cp311-macosx_11_0_arm64.whl.metadata (13 kB)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (8.2.3)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (0.3.4)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (0.9.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (5.2.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (4.65.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (1.10.8)
Requirement already satisfied: jinja2 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (3.1.2)
Requirement already satisfied: setuptools in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (68.0.0)
Requirement already satisfied: packaging>=20.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (23.1)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (3.3.0)
Requirement already satisfied: numpy>=1.19.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (1.24.3)
Requirement already satisfied: typing-extensions>=4.2.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (4.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2023.11.17)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (0.1.4)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (8.0.4)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/melwalsh/anaconda3/lib/python3.11/site-packages (from jinja2->spacy<3.8.0,>=3.7.0->zh-core-web-md==3.7.0) (2.1.1)
Downloading spacy_pkuseg-0.0.33-cp311-cp311-macosx_11_0_arm64.whl (2.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 3.4 MB/s eta 0:00:00a 0:00:01m
?25hInstalling collected packages: spacy-pkuseg, zh-core-web-md
Successfully installed spacy-pkuseg-0.0.33 zh-core-web-md-3.7.0
✔ Download and installation successful
You can now load the package via spacy.load('zh_core_web_md')
Load Language Model#
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.
1. We can import the model as a module and then load it from the module.
import zh_core_web_md
nlp = zh_core_web_md.load()
2. We can load the model by name.
#nlp = spacy.load('es_core_news_md')
If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).
Process Document#
We first need to process our document
with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.
After processing, the document
object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.
In the cell below, we open and the example document. Then we runnlp()
on the text and create our document.
filepath = '../texts/zh.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)
Get Named Entities#
All the named entities in our document
can be found in the document.ents
property. If we check out document.ents
, we can see all the entities from the example document.
document.ents
(秋瑾, 中国, 二万万, 老子, 一, 天足脚, 两, 一二, 三年, 三天, 非洲, 图安乐, 开学堂, 兴工厂)
Each of the named entities in document.ents
contains more information about itself, which we can access by iterating through the document.ents
with a simple for
loop.
For each named_entity
in document.ents
, we will extract the named_entity
and its corresponding named_entity.label_
.
for named_entity in document.ents:
print(named_entity, named_entity.label_)
秋瑾 PERSON
中国 GPE
二万万 CARDINAL
老子 PERSON
一 CARDINAL
天足脚 ORG
两 CARDINAL
一二 CARDINAL
三年 DATE
三天 DATE
非洲 LOC
图安乐 PERSON
开学堂 PERSON
兴工厂 NORP
To extract just the named entities that have been identified as PERSON
(person), we can add a simple if
statement into the mix:
for named_entity in document.ents:
if named_entity.label_ == "PERSON":
print(named_entity)
秋瑾
老子
图安乐
开学堂
NER with Long Texts or Many Texts#
import math
number_of_chunks = 80
chunk_size = math.ceil(len(text) / number_of_chunks)
text_chunks = []
for number in range(0, len(text), chunk_size):
text_chunk = text[number:number+chunk_size]
text_chunks.append(text_chunk)
chunked_documents = list(nlp.pipe(text_chunks))
Get People#
To extract and count the people, we will use an if
statement that will pull out words only if their “ent” label matches “PERSON.”
people = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "PERSON":
people.append(named_entity.text)
people_tally = Counter(people)
df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df
character | count | |
---|---|---|
0 | 秋瑾 | 1 |
1 | 老子 | 1 |
2 | 陈 | 1 |
3 | 开学堂 | 1 |
Get Places#
To extract and count places, we can follow the same model as above, except we will change our if
statement to check for “ent” labels that match “LOC.”
places = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "LOC":
places.append(named_entity.text)
places_tally = Counter(places)
df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df
place | count | |
---|---|---|
0 | 非洲 | 1 |
Get NER in Context#
Show code cell source
from IPython.display import Markdown, display
import re
def get_ner_in_context(keyword, document, desired_ner_labels= False):
if desired_ner_labels != False:
desired_ner_labels = desired_ner_labels
else:
# all possible labels
desired_ner_labels = list(nlp.get_pipe('ner').labels)
#Iterate through all the sentences in the document and pull out the text of each sentence
for sentence in document.sents:
#process each sentence
sentence_doc = nlp(sentence.text)
for named_entity in sentence_doc.ents:
#Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
if keyword.lower() in named_entity.text.lower() and named_entity.label_ in desired_ner_labels:
#Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
#sentence_text = sentence.text
sentence_text = re.sub('\n', ' ', sentence.text)
sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)
print('---')
display(Markdown(f"**{named_entity.label_}**"))
display(Markdown(sentence_text))
for document in chunked_documents:
get_ner_in_context("非洲", document)
---
LOC
要和平的”,又为什么把女子当作非洲的