This is a corpus for Japanese Named Entity Recognition (NER).
BCCWJ core data (136 documents) in the Balanced Corpus of Contemporary Written Japanese (BCCWJ) were annotated with the eight types of NE tags defined by IREX. The NE corpus consists of six types of genres of documents such as blogs, magazines, white papers, and so on, and the corpus contains 2,464 NE tags in total.
(1) Required
perl
(2) Download the following package.
BCCWJ Basic NE corpus (Feb. 1st, 2016 version) (download)
(3) Reproduce corpus as follows:
(3-1) Untar a downloaded package.
tar xfz bccwjne-yyyymmdd.tgz
(3-2) Move to the untared package
cd bccwjne-yyyymmdd
(3-3) Execute the following. core_M-XML is the BCCWJ's core_M-XML data directory
After executing the following files ended with ".ne" are created in "nedata" directory.
On Unix
% perl tools/gendata.prl -d core_M-XML
if you run it on Windows
% perl tools\gendata.prl -w -d core_M-XML
Paper
@inproceedings{iwakura-etal-2016-constructing, title = "Constructing a {J}apanese Basic Named Entity Corpus of Various Genres", author = "Iwakura, Tomoya and Komiya, Kanako and Tachibana, Ryuichi", booktitle = "Proceedings of the Sixth Named Entity Workshop", month = aug, year = "2016", address = "Berlin, Germany", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W16-2706", doi = "10.18653/v1/W16-2706", pages = "41--46", }