Prof. Hyopil Shin's Lab - KR-KOSAC-BERT

KR-KOSAC-BERT

Back

KR-KOSAC-BERT

A pretrained Korean-specific BERT model including sentiment features to perform better at sentiment-related tasks, developed by Computational Linguistics Lab at Seoul National University.

It is based on our character-level KR-BERT models which utilize WordPiece and BidirectionalWordPiece tokenizers.

Sentiment Features

We use the predefined sentiment lexicon of the Korean Sentiment Analysis Corpus (KOSAC) to construct sentiment features. The corpus contains 17,582 annotated sentiment expressions from 332 documents and 7,744 sentences from Sejong Corpus and news articles. The sentiment expressions include values of subjectivity, polarity, intensity, manner of expressions, etc.

The sentiment features included in KOSAC contain polarity and intensity values that we use in our models. There are five classes of polarity values: None (no polarity value), POS (positive), NEUT (neutral), NEG (negative) and COMP (complex).

The four classes of intensity values include: None (no intensity value), High, Medium and Low. These values show how strong the sentiment is in the token.

The polarity and intensity embeddings can be simply added to the token, position and segment embeddings of BERT and be trained just as BERT models.

Masked LM Accuracy

Models

tensorflow

A model using BERT (WordPiece) tokenizer (download)
A model using BidirectionalWordPiece tokenizer (download)

Downstream tasks

Naver Sentiment Movie Corpus (NSMC)

You can use the original BERT WordPiece tokenizer by entering bert for the tokenizer argument, and if you use ranked you can use our BidirectionalWordPiece tokenizer.
Download the checkpoint model and enter its path to init_checkpoint.
Download the NSMC data and enter its path to data_dir.

details : https://github.com/snunlp/KR-KOSAC-BERT

Page updated

Report abuse