Web NER ToolKit

Project Leader: Prof. Chia-Hui Chang

Team members: Chien-Lung Chou, Yuan-Hao Lin, Kuo-Chun Chien, Ya-Yun Huang

Abstract

Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER research are trained mainly on journalistic documents such as news articles for person names, location names, and organization names recognition. Since such NER models are trained to deal with informal documents, the performance drops on Web documents which are less structured and contain noise. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. The pre-processing work is very complicated. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER via automatic labeling and tri-training which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities.

在過去,命名實體辨識(NER)研究都以新聞報導等正式文章中的人名、地名、組織名稱為主,相對地以網路的非正式文章則著墨較少。因此,現有的辨識模組對於網頁內容的辨識效果顯得較差,當需要辨識網頁內容中的命名實體時,勢必要重新訓練辨識模組。然而,訓練一個模型的時間和人力成本非常高,包含前置的大量訓練資料準備、人工收集及標記答案,且為了提升模組辨識效果,必須要為資料做適當切割、符號統一、正規化,以及特徵值的設計、準備已知關鍵詞庫(Dictionary)等,工作非常瑣碎複雜。此外,對於不同語言或不同辨識主題則需重複上述工作。本論文的目的,期能解決上述命名實體辨識工作過於費力耗時的問題,經由給定已知實體名稱的搜尋結果來自動標記訓練資料,並結合Tri-training半監督式訓練來產生NER模組。實驗證實,使用本工具可以套用在不同語言及類型的命名實體辨識,在中文組織名稱辨識的效能可達到86.1%,在日文組織名稱辨識的效能可達到80.3%,在英文組織名稱辨識的效能可達到83.2%,辨識不同主題的中文地點名稱辨識效能可達到84.5%,另外,辨識較長的命名實體如中文地址及英文地址辨識效能也可達到97.2%及94.8%。

DS4NER Package Download

Publication

  • Chien-Lung Chou, Chia-Hui Chang, Yuan-Hao Lin, Kuo-Chun Chien: On the Construction of Web NER Model Training Tool based on Distant Supervision, Transactions on Asian and Low-Resource Language Information Processing, Transactions on Asian and Low-Resource Language Information Processing, 2020 (minor revision).

  • Kuo-Chun Chien and Chia-Hui Chang: Leveraging Memory-Enhanced Conditional Random Fields with Convolutional and Automatic Lexical Features for Chinese Named Entity Recognition. In International Journal of Computational Linguistics & Chinese Language Processing, Vol.24. 1–14. http://www.aclclp.org.tw/clclp/v24n1/v24n1a1.pdf

  • Chien-Lung Chou, Chia-Hui Chang, Ya-Yun Huang: Boosted Web Named Entities Recognition via Tri-Training, Transactions on Asian and Low-Resource Language Information Processing, Volume 16 Issue 2, November 2016.

  • Ya-Yun Huang, Chia-Hui Chang, Chien-Lung Chou: A Tool for Web NER Model Generation Using Search Snippets of Known Entities (基於已知名稱搜尋結果的網路實體辨識模型建立工具), ROCLING 2015.

  • Chien-Lung Chou, Chia-Hui Chang: Named Entity Extraction via Automatic Labeling and Tri-training: Comparison of Selection Methods. AIRS 2014: 244-25.

Datasets