Prof. Hyopil Shin's Lab

KR-BERT

Back

KoRean based Bert pre-trained (KR-BERT)

Korean-specific, small-scale, with comparable or better performances developed by Computational Linguistics Lab at Seoul National University

Vocab, Parameters and Data

Sub-character

Korean text is basically represented with Hangul syllable characters, which can be decomposed into sub-characters, or graphemes. To accommodate such characteristics, we trained a new vocabulary and BERT model on two different representations of a corpus: syllable characters and sub-characters.

In case of using our sub-character model, you should preprocess your data with the code below.

Tokenization

BidirectionalWordPiece Tokenizer

We use the BidirectionalWordPiece model to reduce search costs while maintaining the possibility of choice. This model applies BPE in both forward and backward directions to obtain two candidates and chooses the one that has a higher frequency.

Models

tensorflow

BERT tokenizer, character model (download)
BidirectionalWordPiece tokenizer, character model (download)
BERT tokenizer, sub-character model (download)
BidirectionalWordPiece tokenizer, sub-character model (download)

pytorch

BERT tokenizer, character model (download)
BidirectionalWordPiece tokenizer, character model (download)
BERT tokenizer, sub-character model (download)
BidirectionalWordPiece tokenizer, sub-character model (download)

Requirements

transformers == 2.1.1
tensorflow < 2.0

Downstream tasks

Naver Sentiment Movie Corpus (NSMC)

If you want to use the sub-character version of our models, let the subchar argument be True.
And you can use the original BERT WordPiece tokenizer by entering bert for the tokenizer argument, and if you use ranked you can use our BidirectionalWordPiece tokenizer.
tensorflow: After downloading our pretrained models, put them in a models directory in the krbert_tensorflow directory.
pytorch: After downloading our pretrained models, put them in a pretrained directory in the krbert_pytorch directory.

details : https://github.com/snunlp/KR-BERT

Page updated

Report abuse