Publication
Publication covers all aspects of computational approaches to natural language and music including, but not limited to:
Sentence modeling and classification
Intention identification and slot-filling
Abstract summarization of documents
Contextual spacing and word segmentation
Figurative language analysis
Automatic music generation
International: Journal, Proceedings & Preprint
Prosody, Speech Act, Intent Argument
Y. Song and W. I. Cho, "Study on the domain adaption of Korean speech act using daily conversation dataset and petition corpus," in Proc. NLP4DH & IWCLUL, Dec. 2023, pp. 229-234. (Oral only)
W. I. Cho and N. S. Kim, "Text implicates prosodic ambiguity: A corpus for intention identification of the Korean spoken language," ACM TALLIP, vol. 22, no. 1, Nov. 2022.
W. I. Cho, S. Moon, J. I. Kim, S. M. Kim, and N. S. Kim, "StyleKQC: A style-variant paraphrase corpus for Korean questions and commands," in Proc. LREC, Jun. 2022, pp. 7122-7128. [arXiv] [Github] (Poster, Virtual)
W. I. Cho, Y. K. Moon*, S. Moon*, S. M. Kim, and N. S. Kim, "Machines getting with the program: Understanding intent arguments of non-canonical directives," in Findings of ACL: EMNLP 2020, Nov. 2020, pp. 329–339. [arXiv] [Github]
W. I. Cho, J. Cho, W. H. Kang, and N. S. Kim, "Text matters but speech influences: A computational analysis of syntactic ambiguity resolution ," in Proc. CogSci, Jul. 2020, pp. 1953-1959. [arXiv] [Slide] [Github] (Poster, Virtual)
W. I. Cho, J. I. Kim, Y. K. Moon, and N. S. Kim, "Discourse component to sentence (DC2S): An efficient human-aided construction of paraphrase and sentence similarity dataset," in Proc. LREC, May 2020, pp. 6819-6826. [Github] (Canceled due to COVID-19 outbreak)
W. I. Cho, Y. K. Moon, and N. S. Kim, "Discourse component-based argument extraction of Seoul Korean directives," in Proc. JK 27, Oct. 2019. [Abstract] [Slide] [Github] (Poster)
W. I. Cho*, J. Cho*, J. Kang, and N. S. Kim, "Prosody-semantics interface in Seoul Korean: Corpus for a disambiguation of wh- intervention," in Proc. ICPhS, Aug. 2019, pp. 3902-3906. [Slide] [Github] (Poster)
W. I. Cho, H. S. Lee, J. W. Yoon, S. M. Kim, and N. S. Kim, "Speech intention understanding in a head-final language: A disambiguation utilizing intonation-dependency," arXiv preprint arXiv:1811.04231, Nov. 2018. [Github]
W. I. Cho, Y. K. Moon, W. H. Kang, and N. S. Kim, "Extracting arguments from Korean question and command: An annotated corpus for structured paraphrasing," arXiv preprint arXiv:1810.04631, Oct. 2018. [Github]
Dialogue and Conversational Agents
W. I. Cho, S. Kim, E. Choi, and Y. Jeong, "Assessing how users display self-disclosure and authenticity in conversation with human-like agents: A case study of Luda Lee," in Findings of ACL: AACL-IJCNLP 2022, Nov. 2022, pp. 145-152.
W. I. Cho*, Y. K. Lee*, S. Bae, J. Kim, S. Park, M. Kim, S. Hahn, and N. S. Kim, "When crowd meets persona: Creating a large-scale open-domain persona dialogue corpus," HCOMP WiP, Nov. 2022. (Poster) [Github]
W. I. Cho, S. Kim, E. Choi, and Y. Jeong, "Evaluating how users game and display conversation with human-like agents," in Proc. CODI, Oct. 2022, pp. 19-27. (oral only)
Y. K. Lee*, W. I. Cho*, S. Bae, H. Choi, J. S. Park, N. S. Kim, and S. Hahn, "Feels like I’ve known you forever”: Empathy and self-awareness in human open-domain dialogs," in Proc. CogSci, Jul. 2022. (Poster with abstract) [PsyarXiv]
Computational Linguistics and Figurative Language
W. I. Cho, E. Chersoni, Y. Hsu, and C. Huang, "Modeling the Influence of verb aspect on the activation of typical event locations with BERT," in Findings of ACL: ACL 2021, Aug. 2021, pp. 2922-2929. [Github]
W. I. Cho and N. S. Kim, "Pay attention to categories: Syntax-based sentence modeling with metadata projection matrix," in Proc. PACLIC 34, Oct. 2020, pp. 51-60. (Oral, Virtual)
W. I. Cho, W. H. Kang, and N. S. Kim, "HashCount at SemEval-2018 Task 3: Concatenative featurization of Tweet and hashtags for irony detection," in Proc. SemEval, Jun. 2018, pp. 546-552. [Slide] (Poster)
W. I. Cho, W. H. Kang, H. S. Lee, and N. S. Kim, "Detecting oxymoron in a single statement," in Proc. O-COCOSDA, Nov. 2017, pp. 48-52. [Slide] (Oral)
AI for Social Good, Computational Social Science
W. I. Cho*, E. Cho*, and H. Shin*, "Three disclaimers for safe disclosure: A cardwriter for reporting the use of generative AI in writing process," arXiv preprint arXiv:2404.09041, Apr. 2024.
W. I. Cho*, E. Cho*, and K. Cho, "PaperCard: Towards explainable machine assistance in academic writing," EAAMO 2023, Oct. 2023. [arXiv] [hal] (Poster, non-archival)
S. Min, D. Shin, S. J. Rhee, C. H. K. Park, J. H. Yang, Y. Song, M. J. Kim, K. Kim, W. I. Cho, O. C. Kwon, and Y. M. Ahn, "Acoustic analysis of speech for screening for suicide risk: Machine learning classifiers for between-and within-person evaluation of suicidality," Journal of Medical Internet Research, vol. 25, p.e45456, 2023.
K. Yang*, W. Jang*, and W. I. Cho*, "APEACH: Attacking pejorative expressions with analysis on crowd-generated hate speech evaluation datasets," in Findings of ACL: EMNLP 2022, Dec. 2022, pp. 7076–7086. [Github] [arXiv]
D. Shin*, K. Kim*, S.-B. Lee, C. Lee, Y. S. Bae, W. I. Cho, M. J. Kim, H. K. C. Park, E. K. Chie, N. S. Kim, and Y. M. Ahn, "Detection of depression and suicide risk based on text from clinical interviews using machine learning: Possibility of a new objective diagnostic marker," Frontiers Psychiatry, vol. 13, 2022.
W. I. Cho and Jihyung Moon, "How does the hate speech corpus concern sociolinguistic discussions? A case study on Korean online news comments," in Proc. NLP4DH, ICON Workshop, Dec. 2021, pp. 13-22. (Oral, Virtual)
W. I. Cho and S. Kim, "Google-trickers, Yaminjeongeum, and Leetspeak: An empirical taxonomy for intentionally noisy user-generated text," in Proc. W-NUT, EMNLP Workshop, Nov. 2021, pp. 56-61. (Oral, Virtual)
S. Kim, C. Oh, W. I. Cho, D. Shin, B. Suh, and J. Lee, "Trkic G00gle: Why and how users game translation algorithms," PACM HCI, vol. 5, no. CSCW2, Oct. 2021.
D. Shin, W. I. Cho, C. H. K. Park, S. J. Rhee, M. J. Kim, H. Lee, N. S. Kim, and Y. M. Ahn, "Detection of minor and major depression through voice as a biomarker using machine learning," Journal of Clinical Medicine, vol. 10, no. 14, 3046, 2021.
W. I. Cho, S. J. Cheon, W. H. Kang, J. W. Kim, and N. S. Kim, "Giving space to your message: Assistive word segmentation for the electronic typing of digital minorities," in Proc. ACM DIS, Jun. 2021, pp. 1739–1747. [arXiv] [Github] [Video] (Non-ference)
W. I. Cho*, J. W. Kim*, J. Yang, and N. S. Kim, "Towards cross-lingual generalization of translation gender bias," in Proc. ACM FAccT, Mar. 2021, pp. 449-457. (Oral only, Virtual)
J. Moon*, W. I. Cho*, and J. Lee, "BEEP! Korean corpus of online news comments for toxic speech detection," in Proc. SocialNLP, ACL Workshop, Jul. 2020, pp. 25-31. [arXiv] [Github] (Oral only, Virtual)
W. I. Cho, J. W. Kim, S. M. Kim, and N. S. Kim, "On measuring gender bias in translation of gender-neutral pronouns," in Proc. GeBNLP, ACL Workshop, Aug. 2019, pp. 173-181. [Slide] [arXiv] [Github] (Oral)
Works on Korean NLP
W. I. Cho, S. Moon, and Y. Song, "Revisiting Korean corpus studies through technological advances" in Proc. PACLIC, Dec. 2023. (Poster)
S. Moon, W. I. Cho, H. J. Han, N. Okazaki, and N. S. Kim, "OpenKorPOS: Democratizing Korean tokenization with voting-based open corpus annotation," in Proc. LREC, Jun. 2022, pp. 4975-4983. (Poster, Virtual)
S. Park*, J. Moon*, S. Kim*, W. I. Cho*, J. Han, J. Park, C. Song, J. Kim, Y. Song, T. Oh, J. Lee, J. Oh, S. Lyu, Y. Jeong, I. Lee, S. Seo, D. Lee, H. Kim, M. Lee, S. Jang, S. Do, S. Kim, K. Lim, J. Lee, K. Park, J. Shin, S. Kim, L. Park, A. Oh, J. Ha, and K. Cho, "KLUE: Korean language understanding evaluation," in Proc. NeurIPS, Track on Datasets and Benchmarks, Dec. 2021. [arXiv] (Poster, Virtual)
W. I. Cho, S. M. Kim, H. Cho, and N. S. Kim, "kosp2e: Korean speech to English translation corpus," in Proc. Interspeech, Aug. 2021, pp. 3705-3709. [arXiv] [Github] (Virtual)
W. I. Cho, S. Moon, and Y. Song, "Open Korean corpora: A practical report," in Proc. NLP-OSS, EMNLP Workshop, Nov. 2020, pp. 85–93. [arXiv] [Github] (Oral only, Virtual)
W. I. Cho, S. M. Kim, and N. S. Kim, "Towards an efficient code-mixed grapheme-to-phoneme conversion in an agglutinative language: A case study on to-Korean transliteration," in Proc. W-CALCS, LREC Workshop, May 2020, pp. 65-70. [Github] (Canceled due to COVID-19 outbreak)
W. I. Cho, S. M. Kim, and N. S. Kim, "Investigating an effective character-level embedding in Korean sentence classification," in Proc. PACLIC 33, Sep. 2019, pp. 10-18. [Slide] [arXiv] [Github] (Oral)
Speech and Acoustics
W. I. Cho, J. Kim, and N. S. Kim, "Cross-modal knowledge distillation with dropout-based confidence," in Proc. APSIPA ASC, Nov. 2022, pp. 653-657. (Oral only)
H. Y. Kim, J. W . Yoon, W. I. Cho, N. S. Kim, "Neurally optimized decoder for low bitrate speech codec," IEEE Signal Processing Letters, Dec. 2021.
Y. R. Jo, Y. K. Moon, M. Jung, J. Choi, J. Moon, and W. I. Cho, "VUS at IWSLT 2021: A Finetuned Pipeline for Offline Speech Translation," in Proc. IWSLT, Aug. 2021, pp. 120-124. (Virtual)
Y. R. Jo*, Y. K. Moon*, W. I. Cho, and G. S. Jo, "Self-attentive VAD: Context-aware detection of voice from noise," in Proc. ICASSP, Jun. 2021, pp. 6808-6812. (Oral only, Virtual)
J. W. Yoon, H. Lee, H. Y. Kim, W. I. Cho, and N. S. Kim, "TutorNet: Towards flexible knowledge distillation for end-to-end speech recognition," IEEE/ACM TASLP, vol. 29, pp. 1626-1638, 2021. [arXiv]
W. I. Cho, D. Kwak, J. W. Yoon, and N. S. Kim, "Speech to text adaptation: Towards an efficient cross-modal distillation," in Proc. Interspeech, Oct. 2020, pp. 896-900.[arXiv] (Oral only, Virtual)
I. K. Choi, S. H. Bae, S. J. Cheon, W. I. Cho, and N. S. Kim, "Weakly labeled acoustic event detection using local detector and global classifier," in Proc. APSIPA ASC, Dec. 2017, pp. 1735-1738. (Oral)
W. H. Kang, W. I. Cho, S. Y. Jang, H. S. Lee, and N. S. Kim, "I-vector extraction using speaker relevancy for short duration speaker recognition," in Proc. ICITCS 2017, Aug. 2017, pp. 79-87. (Oral)
Resources (Datasets & Toolkits)
Datasets
OPELA [Github] (Dataset)
Open-domain conversations by personas with empathy, long-term memory, and attractive personality
APEACH [Github] (Dataset)
Attacking pejorative expressions with analysis on crowd-generated hate speech evaluation datasets
KLUE [Github] [Leaderboard] (Dataset)
Korean Language Understanding Evaluation
kosp2e [Github] (Dataset)
Korean Speech to English Translation Corpus
BERT-for-Surprisal [Github]
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)
TGBI-X [Github] (Dataset)
Dataset for evaluating translation gender bias in four different language pairs [Paper]
StyleKQC [Github] (Dataset)
Style-variant paraphrase corpus for Korean Questions and Commands [Paper]
BEEP! [Github] (Dataset)
Korean Hate Speech Dataset (online news comments) [Paper]
ParaKQC [Github] (Dataset & Toolkit)
Parallel dataset of Korean Questions and Commands for sentence similarity test and paraphrase detection [Paper]
sae4K [Github] (Dataset)
A parallel corpus in Korean for a natural language format intent extraction [Paper]
TGBI [Github] (Dataset)
Dataset for evaluating translation gender bias index in ko-en [Paper]
ProSem [Github] (Dataset)
Korean speech corpus of syntactically ambiguous sentences [Paper]
3i4K [Github] (Dataset & Toolkit)
7-class intention identification for Korean conversation-style utterances [Paper]
Toolkits
Cardwriter [Github] (Toolkit)
Simple tool for declaring the usage or non-usage of generative AI in the writing process [Demo]
translit2k [Github] (Toolkit)
PPAP [Github] (Toolkit)
Patent process accelerating program
As a finalist of KPMG Ideation challenge Korea, w/ Ayoung Byun, Jungyeon Lee, and Hosoo Cho
RAWS [Github] (Toolkit)
KorEmo : 5-class Korean speech emotion classifier [Github] (Toolkit)
KorInto : 5-class sentence-final intonation classifier for a syllable-timed and head-final language (Korean) [Github] (Toolkit)
Articles & Tutorials
Open Korean Corpora [Github]
A Living Document for Korean NLP Dataset Curation
Hate speech corpus construction [Slide] [Video] (Tutorial)
Building a Dataset to Measure Toxicity and Social Bias within Language: A Low-Resource Perspective
StorylineNLP [Github] (Tutorial)
A Curriculum-Style Introduction on Computational Linguistics
sentComp [Github] (Tutorial)
Materials for 'Automaton Theories of Human Sentence Comprehension' (Hale, 2014), w/ Kihyo Bhak
OmniKSA [Github] (Article in Korean)
Speech act and its analysis for the (spoken) Korean language: An omnibus description
CoAudioText [Github] (Tutorial)
Tutorial for the audio-text co-utilization (multimodal anaylsis) regarding the disambiguation of syntactically ambiguous sentences in Korean [Paper]
KCharEmb [Github] (Tutorial)
DLK2NLP [Github] (Tutorial)
Day-by-day Line-by-line Keras-based Korean NLP
Domestic Publications
Journal & Proceedings (Selected)
송영숙, 조원익, "사람과 AI 모델 간의 화행 주석 개선 방향 연구," 어문연구, 제 52권, 제 1호, pp. 71-92, 2024.
송영숙, 조원익, "화행에 따른 일상대화 분석과 자연어 생성 연구," 제243회 한국어문교육연구회, 2023. (Oral)
송영숙, 조원익, 박장원, 김성동, "소셜 미디어 뉴스 표제 속 후방 조응 표현의 의미 전달 양상 연구," 한국어 의미학, 제 71권, pp. 75-92, 2021.
조원익, 문지형, "한국어 혐오 표현 코퍼스 구축 방법론 연구: 온라인 악성 댓글에 나타나는 특성을 중심으로 ,"제32회 한글 및 한국어 정보처리 학술대회, 2020, pp. 298-303. [Slide] [Github] (Oral, Virtual)
문영기, 조용래, 조원익, 조근식, "CTC ratio scheduling을 이용한 Joint CTC / Attention 한국어 음성인식,"제32회 한글 및 한국어 정보처리 학술대회, 2020, pp. 37-41. (Oral, Virtual)
김석민, 김정훈, 김형용, 조원익, 김남수, "Transformer 기반 음성 번역의 전처리 기법," 한국전자파학회 하계종합학술대회, 2020, pp. 276. (Oral)
조원익, 문영기, 김종인, 김남수, "담화 성분을 활용한 지시 발화의 키 프레이즈 추출: 한국어 병렬 코퍼스 구축 및 데이터 증강 방법론" 제31회 한글 및 한국어 정보처리 학술대회, 2019, pp. 241-245. [Slide] [Github] (Oral)
조원익, 김남수, "담화성분 기반의 한국어 화행 분류를 통한 텍스트 의도 파악의 모호성 해소 - 전산언어학적 접근," 담화와 인지, 제 26권, 제3호, pp. 227-247, 2019. [Github]
김석민, 조원익, 김형주, 김남수, "멀티모달 접근을 통한 딥러닝 기반 감정인식 알고리즘", 한국통신학회 하계종합학술발표회, 2019, pp 1225-1226. (Poster)
조원익, 천성준, 김지원, 김남수, "문장 정보를 고려한 딥 러닝 기반 자동 띄어쓰기의 개념 및 활용," 제30회 한글 및 한국어 정보처리 학술대회 논문집, 2018, pp. 181-184. [Slide] [Github] (Oral)
조원익, 이강현, 김정훈, 유주현, 김지환, 김남수, "문장 자가 모순성 검출 알고리즘," 한국통신학회 하계종합학술발표회, 2017, pp. 1688-1689. (Oral)
조원익, 김정훈, 천성준, 김남수, "화성 진행 학습 모델을 적용한 규칙 기반의 4성부 합창 음악 생성," 한국통신학회논문지, 제 41권, 제 11호, pp. 1456-1462, 2016.
강우현, 조원익, 강태균, 김남수, "I-벡터 기반 오픈세트 언어 인식을 위한 다중 판별 DNN," 한국통신학회논문지, 제 41권, 제 8호, pp. 958-964, 2016.
조원익, 이철민, 김형용, 장세영, 김남수, "규칙 기반의 4 성부 합창 악보 생성," 한국통신학회 하계종합학술발표회, 2016, pp. 1492-1494. (Poster)
Patents
조원익, 김남수, 김종인, 인간 친화적인 목표 지향 대화 시스템 및 방법, 출원번호 10-2020-0107917, 출원일자 2020년 8월 26일
김남수, 조원익, 곽동현, 텍스트 기반 사전 학습 모델을 활용한 종단형 음성언어이해 지식 증류를 위한 방법, 시스템, 및 컴퓨터 판독가능한 기록 매체, 출원번호 10-2020-0106719, 출원일자 2020년 8월 25일
정민화, 이규환, 조원익, 김종인, 정지오, 음향정보와 텍스트정보를 이용하여 자연어 문장에서 응대 여부를 판단하는 음성인식 방법, 출원번호 10-2019-0165579, 출원일자 2019년 12월 12일
김남수, 조원익, 비정형 질문 또는 요구 발화의 구조화된 패러프레이징 시스템 및 방법, 출원번호 10-2019-0134120, 출원일자 2019년 10월 25일
김남수, 조원익, 간결한 한글 음절 표현을 위한 멀티-핫 벡터 임베딩 방법 및 시스템, 출원번호 10-2018-0167960, 출원일자 2018년 12월 21일 [Kipris]
김남수, 조원익, 문맥 정보를 활용한 딥 러닝 기반의 대화체 문장 띄어쓰기 방법 및 시스템, 출원번호 10-2018-0108009, 출원일자 2018년 9월 10일, 등록번호 10-20866040000, 등록일자 2020년 03년 03일 [Kipris]
김남수, 조원익, 담화 성분과 화행을 고려한 한국어 대화체 코퍼스 분류 방법 및 시스템, 출원번호 10-2018-0093966, 출원일자 2018년 08월 10일, 등록번호 10-20200018121, 등록일자 2020년 02월 19일 [Kipris]
김남수, 조원익, 언어 모델링을 이용한 4성부 합창 악보 생성 방법 및 시스템, 출원번호 10-2017-0114664, 출원일자 2017년 09월 07일 , 등록번호 10-1900020, 등록일자 2018년 09월 12일 [Kipris]