Sentence End and Punctuation Prediction in NLG Text

1st Shared Task on Sentence End and Punctuation Prediction in NLG Text (SEPP-NLG 2021) held at SwissText 2021

Introduction

Proceedings

Participation

Data

Evaluation Metrics

Submission guidelines

Results

Schedule

Program (Monday, 14.6)

Organizers

Introduction

Punctuation marks in automatically generated texts such as translated or transcribed ones may be displaced erroneously for several reasons. Detecting the end of a sentence and placing an appropriate punctuation mark improves the quality of such texts not only by preserving the original meaning but also by enhancing their readability.

The goal of the shared task is to build models for identifying the end of a sentence by detecting an appropriate position for putting an appropriate punctuation mark. Specifically, we offer the following subtasks:

Subtask 1 (fully unpunctuated sentences-full stop detection): Given the textual content of an utterance where the full stops are fully removed, correctly detect the end of sentences by placing a full stop in appropriate positions.

Subtask 2 (fully unpunctuated sentences- full punctuation marks): Given the textual content of an utterance where all punctuation marks are fully removed, correctly predict all punctuation marks.

Participants may choose to attend in one or both of the subtasks.

Proceedings

The proceedings of the shared task (task description paper and system description papers) are published: http://ceur-ws.org/Vol-2957/

Participation

All interested researchers are invited to register by filling a registration form. The registered participants will receive a link to obtain the data as well as a leaderboard to track their progress on all subtasks. All participants will be invited to submit a paper to the shared task proceedings at the Swiss Text analytics conference 2021.

Data

Ultimately, the goal of SEPP-NLG is to predict sentence ends and punctuation in NLG texts. However, there are no corpora that feature NLG texts and their manually transcribed and corrected versions. Therefore, we approximate the setting by using a) transcripts of spoken texts, and b) lower-casing the input and removing all punctuation. While there are many corpora of transcribed spoken language, we choose the Europarl corpus as the source for our data, as it features transcripts in multiple languages.

The data features English, German, French, and Italian and is available here: https://drive.switch.ch/index.php/s/g3fMhMZU2uo32mf Participants are free to choose for which languages they make a submission (but are encouraged to participate in all languages).

The data format is as follows: Lower-cased tokens per file are listed vertically, and the labels for subtask 1 (binary classification) and 2 (multiclass classification) are appended horizontally, separated by tab (see picture below). The labels encode whether a token emits a sentence end (task 2) or a punctuation symbol (subtask 2).

NOTE: Excel/LibreOffice might incorrectly display/align the label columns. Try viewing the TSV files in a text editor if the labels seem off.

UPDATE (14.4.2021): We have received detailed feedback and comments on the initial data release (thank you!). Based on this feedback, we applied changes to the data that affect both subtasks. Most importantly, we consolidated the selection of punctuation symbols for subtask 2 to “: - , ? . 0” (0 indicating no punctuation) and mapped “! ;” to “.”. We believe this to be a more realistic setting when processing NLG texts such as STT output. We removed all sentences from the data that contain other punctuation symbols such as parentheses, as there is no straightforward way to remove punctuation without interfering with the naturalness of a sentence. This change affects both subtasks and resulted in removing <10% of the data per language. We also removed HTML artifacts, and special (non-visible) characters (zero width space, soft hyphen). The new version (sepp_nlg_2021_train_dev_data_v5.zip) can be found on the task drive: https://drive.switch.ch/index.php/s/g3fMhMZU2uo32mf Please use this version for your experiments and submissions.

UPDATE (8.6.2021): The full task data has been uploaded to the shared task drive (sepp_nlg_2021_data_v5.zip).

Evaluation Metrics

Subtask 1: F1 score of the positive class (sentence end)
Subtask 2: Macro F1
The evaluation scripts can be downloaded here: https://drive.switch.ch/index.php/s/g3fMhMZU2uo32mf

Submission guidelines

To submit your solution, create a zip archive containing your predictions for the dev and test sets per language, i.e. a zip containing en/dev/*.tsv, fr/test/*.tsv etc. Please verify that the evaluation scripts are able to process your dev sets correctly (the evaluation scripts are available in the task drive). Partial submissions for only one language and/or subtask are welcomed, but we encourage you to participate in all languages. Name your submissions zip file after your team (e.g. ‘ZHAW_sepp_nlg_2021_submission_3.zip’), put it on a server/drive and send us an email containing the link through which we can download your submission zip. (If you do not have a server where you can put the files, consider using services such as dropbox, Google Drive, or wetransfer). Please provide a technical description of your system using the ACL template: http://acl2020.org/downloads/acl2020-templates.zip The description can be up to 8 pages long (excluding references). Thank you for sending us your PDF via email (see schedule).

Data format example

Results

We provide a baseline based on spaCy for subtask 1 and a Bert-based baseline covering both tasks. The scripts can be downloaded here: https://drive.switch.ch/index.php/s/g3fMhMZU2uo32mf

SEPP-NLG 2021 Evaluation results

Schedule

Registration: March 01, 2021

Training and development sets released: March 01, 2021

Test set evaluation starts: May 03, 2021

Registration closes: May 08, 2021

Test set evaluation ends: May 10, 2021

Paper submission deadline: May 17, 2021

Notification to authors: June 01, 2021

Camera-ready papers due: June 10, 2021

Program (Monday, 14.6)

9:30 - 9:40: HTW+t2k (Winner)

9:40 - 9:50: OnPoint (Winner)

9:50 - 10:00: Unbabel-INESC-ID (Winner)

10:00 - 10:10: OneNLP

10:10 - 10:20: UR-mSBD

10:20 - 10:30: HULAT_UC3M

Organizers

Don Tuggener, ZHAW, InIT (tuge@zhaw.ch)

Ahmad Aghaebrahimian, ZHAW, InIT (agha@zhaw.ch)

Page updated

Google Sites

Report abuse