Named Entity Recognition for Chinese

Note

This section, “Working in Languages Beyond English,” is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I’m grateful to Quinn for helping expand this textbook to serve languages beyond English.

In this lesson, we’re going to learn about a text analysis method called Named Entity Recognition (NER) as applied to Chinese. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.


Dataset

The example text for Chinese is 敬告中国二万万女同胞 by 秋瑾. (Thanks to Paul Vierthaler for selecting and finding the text.)

Here’s a preview of spaC’s NER tagging 敬告中国二万万女同胞.

If you compare the results to the English example, you’ll notice that the Chinese NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the data sources used to train Chinese on the spaCy model page.

秋瑾《敬告 中国 GPE 二万万女同胞》

唉!世界上最不平的事,就是我们 二万万 PERCENT 女同胞了。从小生下来,遇着好老子,还说得过 ;遇 PERSON 着脾气 杂冒 WORK_OF_ART 、不讲情理的,满嘴连说:“晦气,又是一个没用的。”恨不得拿起来摔死。总抱着“将来是别人家的人”这句话,冷一眼、白一眼地看待;没到 几岁 DATE ,也不问好歹,就把一双雪白粉嫩的天足脚,用白布缠着,连睡觉的时候,也不许放松一点,到了后来肉也烂尽了,骨也折断了,不过讨亲戚、朋友、邻居们一声“某人家姑娘脚小”罢了。这还不说,到了择亲的时光,只凭着 CARDINAL 个不要脸媒人的话,只要男家有钱有势,不问身家清白,男人的性情好坏、学问高低,就不知不觉应了。到了过门的时候,用一顶红红绿绿的花轿,坐在里面,连气也不能出。到了那边,要是遇着男人虽不怎么样,却还安分,这就算前生有福今生受了。遇着不好的,总不是说“前生作了孽”,就是说“运气不好”。要是说 一二 CARDINAL 句抱怨的话,或是劝了男人几句,反了腔,就打骂俱下;别人听见还要说:“不贤惠,不晓得妇道呢!”诸位听听,这不是有冤没处诉么?还有一桩不公的事:男子死了,女子就要带 三年 DATE 孝,不许二嫁。女子死了,男人只带几根蓝辫线,有嫌难看的,连带也不带;人死还没 三天 DATE ,就出去偷鸡摸狗;七还未尽,新娘子早已进门了。上天生人,男女原没有分别。试问天下没有女人,就生出这些人来么?为什么这样不公道呢?那些男子,天天说“心是公的,待人是要和平的”,又为什么把女子当作 非洲 LOC 的黑奴一样看待。不公不平,直到这步田地呢?
  诸位,你要知道天下事靠人是不行的,总要求己为是。当初那些腐儒说什么“男尊女卑”、“女子无才便是德”、“夫为妻纲”这些胡说,我们女子要是有志气的,就应当号召同志与他反对, 陈后 PERSON 主兴了这缠足的例子,我们要是有羞耻的,就应当兴师问罪;即不然,难道他捆着我的腿?我不会不缠的么?男子怕我们有知识、有学问、爬上他们的头,不准我们求学,我们难道不会和他分辨,就应了么?这总是我们女子自己放弃责任,样样事体一见男子做了,自己就乐得偷懒, 图安乐 ORG 。男子说我没用,我就没用;说我不行,只要保着眼前舒服,就作奴隶也不问了。自己又看看无功受禄,恐怕行不长久,一听见男子喜欢脚小,就急急忙忙把它缠了,使男人看见喜欢,庶可以藉此吃白饭。至于不叫我们读书、习字,这更是求之不得的,有甚么不赞成呢?诸位想想,天下有享现成福的么?自然是有学问、有见识、出力作事的男人得了权利,我们作他的奴隶了。既作了他的奴隶,怎么不受压制呢?自作自受,又怎么怨得人呢?这些事情,提起来,我也觉得难过,诸位想想总是个中人,亦不必用我细说。
  但是从此以后,我还望我们姐妹们,把从前事情,一概搁开,把以后事情,尽力作去,譬如从前死了,现在又转世为人了。老的呢,不要说“老而无用”,遇见丈夫好的要开学堂,不要阻他;儿子好的,要出洋留学,不要阻他。中年作媳妇的,总不要拖着丈夫的腿,使他气短志颓,功不成、名不就;生了儿子,就要送他 进学堂 LOC ,女儿也是如此,千万不要替他缠足。幼年姑娘的呢,若能够 进学堂 LOC 更好;就不进学堂,在家里也要常看书、习字。有钱作官的呢,就要劝丈夫 开学堂 ORG 、兴工厂,作那些与百姓有益的事情。无钱的呢,就要帮着丈夫 苦作 PERSON ,不要偷懒吃闲饭。这就是我的望头了。诸位晓得国是要亡的了,男人自己也不保,我们还想靠他么?我们自己要不振作,到国亡的时候,那就迟了。诸位!诸位!须不可以打断我的念头才好呢!

NER with spaCy

If you’ve already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

Install spaCy

!pip install -U spacy

Import Libraries

We’re going to import spacy and displacy, a special spaCy module for visualization.

import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We’re also going to import the Counter module for counting people, places, and things, and the pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).

Download Language Model

Next we need to download the Chinese-language model (zh_core_web_md), which will be processing and making predictions about our texts. You can read more about the data sources used to train Chinese on the spaCy model page.

!python -m spacy download zh_core_web_md

Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

import zh_core_web_md
nlp = zh_core_web_md.load()

2. We can load the model by name.

#nlp = spacy.load('es_core_news_md')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).

Process Document

We first need to process our document with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we runnlp() on the text and create our document.

filepath = '../texts/zh.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

Get Named Entities

All the named entities in our document can be found in the document.ents property. If we check out document.ents, we can see all the entities from the example document.

document.ents
(中国, 二万万, ;遇, 杂冒, 几岁, 两, 一二, 三年, 三天, 非洲, 陈后, 图安乐, 进学堂, 进学堂, 开学堂, 苦作)

Each of the named entities in document.ents contains more information about itself, which we can access by iterating through the document.ents with a simple for loop.

For each named_entity in document.ents, we will extract the named_entity and its corresponding named_entity.label_.

for named_entity in document.ents:
    print(named_entity, named_entity.label_)
中国 GPE
二万万 PERCENT
;遇 PERSON
杂冒 WORK_OF_ART
几岁 DATE
两 CARDINAL
一二 CARDINAL
三年 DATE
三天 DATE
非洲 LOC
陈后 PERSON
图安乐 ORG
进学堂 LOC
进学堂 LOC
开学堂 ORG
苦作 PERSON

To extract just the named entities that have been identified as PERSON (person), we can add a simple if statement into the mix:

for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)
;遇
陈后
苦作

NER with Long Texts or Many Texts

import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)
chunked_documents = list(nlp.pipe(text_chunks))

Get People

To extract and count the people, we will use an if statement that will pull out words only if their “ent” label matches “PERSON.”

people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PERSON":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df
character count
0 ;遇 1
1 1
2 苦作 1

Get Places

To extract and count places, we can follow the same model as above, except we will change our if statement to check for “ent” labels that match “LOC.”

places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df
place count
0 进学堂 2
1 非洲 1

Get NER in Context

from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PERSON', 'ORG', 'LOC']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))
for document in chunked_documents:
    get_ner_in_context('进学堂', document)

LOC

成、名不就;生了儿子,就要送他进学堂


LOC

年姑娘的呢,若能够进学堂更好;就不进