Data
Data sets
Licensing and attribution
Our training corpora have different licenses that are summarized here. Besides, we ask that you cite the shared task overview paper once it can be referenced (at the moment a BibTeX entry does not exist).
FocusNews corpus
This dataset can be used only for the WMT-SLT shared task. Other uses of the data require express permission by the data owners. Contact us for further information.
SRF corpus
This dataset can be used for non-commercial research under an Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0).
Overview of data sets
We provide separate training, development and test data. The training data is available right away. The development and test data will be released in several stages, starting with a release of the development sources only.
The training data comprises two corpora, called FocusNews and SRF, see below for a more detailed description. The linguistic domain of both corpora is general news, and both contain parallel data between Swiss German Sign Language (DSGS) and German. The corpora are distributed through Zenodo.
Training corpora statistics
Accessing the corpora
Direct download links:
Zenodo can be accessed via their API as well in different ways. Here is an example showing how to download the FocusNews data. The first step is to "request access" on Zenodo. You will then receive an email with a personalized link of the form https://zenodo.org/record/6621480?token=<PERSONAL_TOKEN>.
If you would like to use our baseline code, there is no need to download the training, development or test data manually. The code downloads the data automatically.
# store cookie file
$ curl --cookie-jar zenodo-cookies.txt "https://zenodo.org/record/6621480?token=<PERSONAL_TOKEN>"
# find direct link to focusnews.zip in JSON response
$ curl --cookie zenodo-cookies.txt "https://zenodo.org/api/records/6621480"
...
"links":{"self":"https://zenodo.org/api/files/123456789-1234-1234-1234-ef92d689cc3c3b859/focusnews.zip"}
...
# download zip file
$ curl --cookie zenodo-cookies.txt "https://zenodo.org/api/files/123456789-1234-1234-1234-ef92d689cc3c3b859/focusnews.zip" > focusnews.zip
API access to Zenodo corpora example
Training corpus 1: FocusNews
(This text currently describes the following Zenodo release version: 1.3)
This data originates from a former Deaf online TV channel, FocusFive. We provide the news episodes (FocusNews), as opposed to other programs. The data consists of 197 videos with associated subtitles of approximately 5 minutes each. The videos feature Deaf signers of DSGS and represent the source for translation. The German subtitles were created post-hoc by hearing sign language interpreters.
We provide episodes within the time range of 2008 (starting with episode 43) to 2014 (up to episode 278). The videos were recorded with different framerates, either 25, 30 or 50 fps. The video resolution is 1280 x 720.
While this data set is small (by today's standards in spoken language machine translation), we emphasize the importance of using Deaf signer data for shared tasks like ours. There are crucial differences between the signing of hearing interpreters and deaf signers.
Training corpus 2: SRF
(This text currently describes the following Zenodo release version: 1.2)
These are daily national news and weather forecast episodes broadcast by the Swiss National TV (Schweizerisches Radio und Fernsehen, SRF). The episodes are narrated in Standard German of Switzerland (different from Standard German of Germany, and different from Swiss German dialects) and interpreted into Swiss German Sign Language (DSGS). The interpreters are hearing individuals, some of them children of Deaf adults (CODAs).
The subtitles are partly preproduced, partly created live via respeaking based on automatic speech recognition.
While both the subtitles and the signing are based on the original speech (audio), due to the live subtitling and live interpreting scenario, a temporal offset between audio and subtitles as well as audio and signing is inevitable. This is visualized in the figure below:
In our training corpus, the offset between the signing and the subtitles was manually corrected by Deaf signers with a good command of German. The live interview and weather forecast parts of each episode were ignored, as the quality of the subtitles tends to be noticeably lower for these parts.
Parallel data
The parallel data comprises 29 episodes of approximately 30 minutes each with the sign language videos (without audio track) and the corresponding subtitles.
We selected episodes from two time spans: 13/03/2020 to 19/06/2020 and 04/01/2021 to 26/02/2021 featuring three different sign language interpreters. (Three interpreters consented to having their likeness used for this shared task.)
The videos have a framerate of 25 fps and a resolution of 1280 x 720.
Monolingual data
In addition, as monolingual data, we provide all available German subtitles from 2014 to 2021. In total, there are 1949 subtitle files with a total of 570k sentences (after automatic segmentation).
Earlier release of a similar dataset
The data provided here is an extended version of the dataset published as part of the Content4All project (EU Horizon 2020, grant agreement no. 762021).
Development data
(This text currently describes the following release version: 3.0)
The development data consists of segments extracted from undisclosed SRF and FocusNews episodes (see above for a general description).
This data was also manually aligned and the signer is a "known" person that appeared in the training set. The framerate of development videos is 25 fps for SRF segments and 50 fps for FocusNews segments.
Test data
(This text currently describes the following release version: 3.0. Currently, only the test sources are released.)
We distribute separate test data for our 2 translation directions.
DSGS-to-German
Additional, undisclosed SRF and FocusNews episodes that are manually aligned. As for the development data, the signers are known persons and the framerate of videos is 25 fps for SRF and 50 fps otherwise.
German-to-DSGS
This subset of the test data has two distinct parts:
Additional, undisclosed FocusNews episodes that are manually aligned. As for the development data, the signers are known persons and the framerate of videos is 50 fps.
New translations created specifically for this shared task. The domain is identical to the training data (general news). In this case German subtitles are the source for human translation, DSGS videos are the target. The human translator is deaf (in contrast to all of the SRF data, where signers are hearing interpreters). The framerate of these videos is 50 fps and they are recorded in front of a green screen.
For German-to-DSGS translation we consider it important that the reference translations are created by deaf signers instead of hearing interpreters.
Preprocessing
For each data set described above we provide videos and corresponding subtitles. In addition, we include pose estimates (location of body keypoints in each frame) as a convenience for participants.
Video processing
Videos are re-encoded with lossless H264 and use an mp4 container. The framerate of videos is unchanged, meaning either 25, 30 or 50.
We are not distributing the original videos, but ones that are preprocessed in a particular way so that they only show the part of each frame where the signer is located (cropping) and the background is replaced with a monochrome color (signer masking):
Original frame
Cropped
Masked
Cropping
We identify a rectangle (bounding box) where the signer is located in each frame, then crop the video to this region.
Signer segmentation and masking
To the cropped video we apply an instance segmentation model, Solo V2 (Wang et al., 2020), to separate the background from the signer. This produces a mask that can be superimposed on the cropped video to replace each background pixel in a frame with a grey color.
Subtitle processing
For subtitles that are not manually aligned (all of FocusNews and monolingual SRF data), automatic sentence segmentation is used to re-distribute text across subtitle segments.
This process also adjusts timecodes in a heuristic manner if needed. For instance, if automatic sentence segmentation detects that a well-formed sentence stops in the middle of a subtitle, a new end time will be computed. The end time is proportional to the location of the last character of the sentence, relative to the entire length of the subtitle. See Example 2 below for an illustration of this case.
81
00:05:22,607 --> 00:05:24,687
Die Jury war beeindruckt
82
00:05:24,687 --> 00:05:28,127
und begeistert von dieser gehörlosen Frau.
Original subtitle example 1
48
00:05:22,607 --> 00:05:28,127
Die Jury war beeindruckt und begeistert von dieser gehörlosen Frau.
After automatic segmentation example 1
7
00:00:24,708 --> 00:00:27,268
Die Invalidenversicherung Region Bern startete
8
00:00:27,268 --> 00:00:29,860
dieses Pilotprojekt und will herausfinden, ob man es
9
00:00:29,860 --> 00:00:33,460
zukünftig umsetzen kann. Es geht um die Umsetzung
Original subtitle example 2
4
00:00:24,708 --> 00:00:31,720
Die Invalidenversicherung Region Bern startete dieses Pilotprojekt und will herausfinden, ob man es zukünftig umsetzen kann.
After automatic segmentation example 2
Pose processing
"Poses" are an estimate of the location of body keypoints in video frames. The exact set of keypoints depends on the pose estimation system, well known ones are OpenPose and Mediapipe Holistic. Usually such a system provides 2D or 3D coordinates of keypoints in each frame, plus a confidence value for each keypoint.
The input for pose processing are cropped and masked videos (see above). As any machine learning system, pose prediction does not have perfect accuracy, it is expected that it fails in some instances.
OpenPose
We are using the OpenPose 135 model (as opposed to the 137 keypoint model which is also widely used).
OpenPose often detects several people in our videos, even though there is only 1 single person present. We distribute the original predictions which contain all people that OpenPose detected.
Mediapipe Holistic
As an alternative, we also predict poses with the Mediapipe Holistic system developed by Google. Unlike our OpenPose model, it is a regression model and outputs 3D (X,Y, Z) coordinates. Values from Holistic are normalized between 0 and 1, instead of referring to actual video coordinates.