KCDD Dataset

This page is for datasets and benchmark code for review.

Dataset Card

Data Summary

This paper presents a Korean Crime Dialogue Dataset (KCDD) as the first Korean dialogue dataset for classifying violence that occurred in offline settings. KCDD contains 22,249 dialogues and has four criminal classes that meet the international legal standards(ICCS) and one clean class <Serious Threats, Extortion or Blackmail, Harassment in the Workplace, Other Harassment, and Clean Dialogue>

Download Link : KCDD
Languages : Republic of Korea
Dataset Splits : Train(17,799) Vaild(2,225), Test(2,225)
Dataset Example & Components:

id : Unique number for each data
Text : The data is in the form of a conversation, with each class consisting of at least 5 turns and at least 2 speakers. Each speaker is distinguished by being given a unique alphabet.
Label : Serious Threats(020121), Extortion or Blackmail(02051), Harassment in the Workplace(20811), Other Harassment(020819), and Clean Dialogue(000001)
Speaker_label : Each speaker is labeled with a speaker type. / n(nomal person), g(perpetrator), p(victim)

The KCDD can be noncommercially used with a custom license CC-BY-NC 4.0.

Benchmark Description

We then propose a Relationship-Aware BERT, as the strong baseline for the proposed dataset. The model shows that understanding varying relationships among interlocutors improves the performance of crime dialogue classification.

Download Link (Code) : RABERT
How to Use

1) Preferences

pip install -r requirements.txt

2) Train & Eval

You can adjust the hyper-parameters via the config file inside the config folder.

When you run this code, it will automatically evaluate each model for the dev and test sets.

The trained models and evaluation results are saved in the ./ckpt folder.

python3 Relationship_aware_BERT.py --config_file Relationship_aware_BERT.json