FinSBD-2019 Shared Task

Sentence Boundary Detection in PDF Noisy Text in the Financial Domain

Introduction

Sentences are basic units of the written language and detecting the beginning and end of sentences, or sentence boundary detection (SBD) is a foundational first step in many Natural Language Processing (NLP) applications, such as POS tagging; syntactic, semantic, and discourse parsing; information extraction; or machine translation.

Despite its important role in NLP, sentence boundary detection has so far not received enough attention. Previous research in the area has been confined to formal texts only (news, European Parliament proceedings, etc.) where existing rule-based and machine learning approaches are extremely accurate (when the data is perfectly clean). No sentence boundary detection research to date has addressed the problem in noisy texts extracted automatically from machine-readable formats (generally PDF file format) files such as financial documents.

In this shared task, we focus on extracting well segmented sentences from Financial prospectuses by detecting their beginning and ending boundaries. These are official PDF documents in which investment funds precisely describe their characteristics and investment modalities. The most important step of extracting any information from these files is to parse them to get noisy unstructured text, clean it, format information (by adding several tags) and finally, transform it into semi-structured text, where sentence boundaries are well marked.

Shared Task

** News **

  • Participants results:
  • Registration will be open until May 13, 2019, day Systems' outputs will be collected
  • We encourage all participants to send the description of their methods for the shared task peer review process (see important dates for more information). The accepted papers will be published as other workshop papers in the ACL proceedings and will have either oral or poster presentations during the workshop.

Task Description

As part of the FinNLP, we present a shared task on sentence boundary detection in noisy text extracted from financial prospectuses, in two languages: English and French.

Systems participating in this shared task will be given a set of textual documents extracted from pdf files, which are to be automatically segmented to extract a set of well delimited sentences (clean sentences).

Participants can choose to work on both languages, or submit systems for one language only.

In addition to the textual version of the documents, we will provide their PDF original files. Recommendations of additional language resources will also be listed/provided for some languages by the organizers.

The task is open to everyone. The only exception are the co-chairs of the organizing team, who cannot submit a system, and who will serve as an authority to resolve any disputes concerning ethical issues or completeness of system descriptions.

Data Format:

In provided dataset, participants will get a json format containing "text", that corresponds to the text to segment, begin_sentence and end_sentence correspond to all indexes of tokens marking the beginning and the end of well formed sentences in the text. Notice that the provided text was already word tokenized using NLTK, participants should keep this tokenization as it is since all tokens indexes are instantiated based on it. The first token in the text will have then the index 0 .

[{

'text': " UFF Sélection Alpha AINFORMATIONS CLÉS POUR L' INVESTISSEUR

« Ce document fournit des informations essentielles aux investisseurs de cet OPCVM . Il ne s' agit pas d' un document promotionnel . Les informations qu ' il contient vous sont fournies conformément à une obligation légale , afin de vous aider à comprendre en quoi consiste un investissement dans ce fonds et quels risques y sont associés . ..." ,

'begin_sentence': [8, 21, 31 , ...],

'end_sentence': [20, 30, 66, ...]

}],

All of the input text will be preprocessed in a common way to make sure all participants have access to all of these features at no additional overhead novelty cost. Rule-based, machine learning, deep learning, or hybrid techniques are all allowed.

Participants will get annotated training/dev data, and further a blind test data as a json format but with just the text. They should then predict the lists begin_sentence and end_sentence and submit the result in the same json format as of the training data.

Evaluation

The evaluation metrics will include Precision, Recall and F-score of predicted begin sentences and end sentences. The F-score will be the official metric.

An evaluation script will be provided to all the teams.

Important Dates

  • February 28, 2019: First announcement of the shared task and beginning of registration
  • March 7, 2019: Release of training data and scoring script
  • April 29, 2019: Registration deadline
  • May 6, 2019: Test set made available
  • May 13, 2019: Systems' outputs collected
  • May 27, 2019: Shared task system paper submissions due
  • June 17, 2019: Notification of acceptance
  • June 24, 2019: Camera-ready version of shared task system papers due
  • August 10-12, 2019: FinNLP 2019 Workshop in Macau

Submission Details

Each team is allowed to submit up to two runs for each language. In other words, a team can test several methods or parameter settings and submit the two they prefer.

Please structure your test results as follows:

  • One file per language, named <team><N>.<language>.test, where
    • <team> stands for you team name (please use only ASCII letters, digits and “-” or “_”)
    • <N> (1, 2 or 3) is the run number
    • <language> stands for the language (en or fr)
  • The file contents and format should be the same as the gold standard files provided with the sample and training data.
  • Put all files in one directory called <team>
  • Create an archive with the contents of this directory (either <team>.tar.bz2, <team>.tar.gz, or <team>.zip)

BEFORE MAY 13, please send the archive as an attachment in a message together with factual summary information on your team and method:

To: FinSBD2019.shared.task@gmail.com

Shared Task Co-organizers - Fortia Financial Solutions

· Sira Ferradans sira.ferradans@fortia.fr

· Abderrahim Ait-Azzi abderrahim.aitazzi@fortia.fr

· Guillaume Hubert guillaume.hubert@fortia.fr

· Houda Bouamor hbouamor@qatar.cmu.edu

Registration

Participants need to register using the registration form below.

Once registered, all participating teams will be provided with a common training set for each language (English and French), which includes common input (noisy text extracted from PDF) and a corresponding output where sentence boundaries are marked (well segmented sentences). A common development set will also be provided. A blind test set will be used to evaluate the output of the participating teams.

Contact

Questions about the FinSBD-2019 shared task can be sent to the organizers directly using the following email address: FinSBD2019.shared.task@gmail.com