BCCWJ Basic Japanese NE corpus

BCCWJ Basic Japanese Named Entity (NE) corpus (Japanese page)

This is a corpus for Japanese Named Entity Recognition (NER).

BCCWJ core data (136 documents) in the Balanced Corpus of Contemporary Written Japanese (BCCWJ) were annotated with the eight types of NE tags defined by IREX. The NE corpus consists of six types of genres of documents such as blogs, magazines, white papers, and so on, and the corpus contains 2,464 NE tags in total.

How to Use This

(1) Required

- BCCWJ
- perl

(2) Download the following package.

BCCWJ Basic NE corpus (Feb. 1st, 2016 version) (download)

(3) Reproduce corpus as follows:

(3-1) Untar a downloaded package.

- tar xfz bccwjne-yyyymmdd.tgz

(3-2) Move to the untared package

- cd bccwjne-yyyymmdd

(3-3) Execute the following. core_M-XML is the BCCWJ's core_M-XML data directory

After executing the following files ended with ".ne" are created in "nedata" directory.

- On Unix
- % perl tools/gendata.prl -d core_M-XML
- if you run it on Windows
- % perl tools\gendata.prl -w -d core_M-XML

Link

Paper

Constructing a Japanese Basic Named Entity Corpus of Various Genres

@inproceedings{iwakura-etal-2016-constructing, title = "Constructing a {J}apanese Basic Named Entity Corpus of Various Genres", author = "Iwakura, Tomoya and Komiya, Kanako and Tachibana, Ryuichi", booktitle = "Proceedings of the Sixth Named Entity Workshop", month = aug, year = "2016", address = "Berlin, Germany", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W16-2706", doi = "10.18653/v1/W16-2706", pages = "41--46", }

Page updated

Google Sites

Report abuse