2012-02-01 (Revised)The purpose of this document is to specify a mechanism for embedding WebVTT in a WebM file.
WebVTT is a standard for subtitles, captions, and related timed metadata. A web media text track comprises a set of cues, each of which has an optional identifier, timestamp, optional settings, and the actual payload text. The cues are listed in a dedicated WebVTT file (having the .vtt file extension by convention) that is associated with a web media file using the src and kind attributes of the HTML5 track element.
WebM is a media standard for web video. Its container format is based on Matroska and there are separate tracks for video and audio. There is interest in embedding the contents of a WebVTT file inside a WebM file, so that the video text track does not have to be carried out-of-band, separate from the video itself.
Our goal is to embed the contents of a WebVTT file in a WebM file, in a way that preserves the information from each cue, and without too much disruption to the container standard.
A WebVTT cue is a set of lines of text comprising an optional identifier, a timestamp and optional settings, followed by the payload. The payload is the text of the subtitle or caption, chapter title, or metadata.
This format is actually very similar to the SubRip file format [SRT]. Matroska already supports embedded SRT subtitles as a track (see [MKVSRT]), by embedding just the SRT payload as the data portion of a block. However, this exact approach might not be suitable for WebVTT because:
The WebVTT cues are stored as the data portion of Block elements in the track, per the formatting described below. All WebVTT data stored within a WebM Block must be encoded as UTF-8. The timestamp of the WebM block and its duration are synthesized from the start and end times specified on the timestamp of the WebVTT cue. A BlockGroup element (not a SimpleBlock) must be used to contain the Block element, in order to also use a BlockDuration element, which is necessary to losslessly encode the original timestamp of the WebVTT cue.
If the WebVTT cue includes a WebVTT cue identifier then the WebVTT cue identifier is written to the WebM Block followed by a WebVTT line terminator. If the WebVTT cue does not have a WebVTT cue identifier then a WebVTT line terminator is written to the WebM Block. The empty line is used to distinguish that there was no WebVTT cue identifier in the original WebVTT cue.
The WebVTT cue timings is not written to the WebM Block. The start and end time of the WebVTT cue is synthesized from the start time and duration of the WebM Block.
If the WebVTT cue includes a WebVTT cue settings then the WebVTT cue settings is written to the WebM Block followed by a WebVTT line terminator. If the WebVTT cue does not have a WebVTT cue settings then a WebVTT line terminator is written to the WebM Block.
The cue payload is then written to the WebM Block.
Note that no WebVTT data is stored in the CodecPrivate element of the WebM Track header. All WebVTT cues are stored as Block elements for the track.
The timestamps for WebVTT cues can overlap in time. This is how roll-up captions work: multiple cues are rendered simultaneously, and when the top cue expires, the other cues move up and a new cue appears at the bottom. The WebM block timestamps must therefore be allowed to be monotonically increasing (a requirement already needed for the WebM container to support VP8 alt-ref frames), and the duration for a block must be allowed to overlap the start time of the next block.
WebVTT chapter cues are used for navigation and so they are handled differently, because they must all be together and immediately available. For this reason, WebVTT chapter cues should not be embedded the same as for timed cues (a representation that would vitiate their use for navigation); instead they should be converted to Matroska chapters (see [MKVCHAP]) and embedded that way. Matroska chapters are a superset of WebVTT chapter cues and therefore the conversion is lossless.
Representation of Block Payload
The simplest method for storing the WebVTT cue would be to embed the entire cue as the data portion of the Block element. However, we cannot do this, because it would then be impossible to change the block timestamps when the track is edited, as this would cause the timestamp of the cue (embedded as the block payload) to get out of sync with the timestamp of its enclosing block. We must satisfy the design constraint that video editors be allowed to manipulate WebM tracks without necessarily having intimate knowledge of the track’s codec (here, WebVTT).
For this reason, the start and end timestamps of the cue need to be stripped from the timestamp line when embedding the WebVTT cue in the WebM block. The arrow symbol remains (as it must, to determine whether a cue identifier is present), but there are no actual time values on the timestamp line.
This should not cause too much pain for muxers, because they must parse WebVTT cues anyway in order to synthesize the time and duration of the WebM block.
Demuxers will have to hand the timestamp and block payload off to some WebVTT-aware component in order for it to be "decoded", but this no different from what any demuxer must do; the only thing different is the codec.
Placement of WebVTT Cues
The WebVTT file content could be stored in the CodecPrivate element of the track. This could be useful for situations that require all of cues to be together, such as for chapter cues. This storage location could also be used for the file-wide metadata that precedes the actual WebVTT cues (see below).
However, putting any payload in the CodecPrivate area would make the payload very brittle, in the sense that it can break too easily during editing or remuxing.
SRT-style Embedding of Cue Payload
An alternative is for the muxer to fully parse each WebVTT cue, embed the payload the same as for SRT (the cue payload only as the data part of the Block element), and use the BlockAdditions element to store the cue identifier and cue settings.
The advantage of this approach is that the information associated with a cue would already be in binary form, so in principle this would make it simpler for parsers or other downstream clients that must also parse the WebVTT cues. (But then again, they might also prefer to use the text as is, so perhaps this is not much of an advantage.) There might be a storage penalty however, because Matroska elements do have a certain amount of overhead.
The disadvantage is that this ties WebM more closely to WebVTT, since any changes to the WebVTT standard would have to be matched with concomitant changes to the WebM standard; blob-style embedding avoids this.
Live Chapter Cues
Live WebVTT chapter cues are used to mark the place in the stream where a special event has occurred (e.g. this is a live presentation of a football match, and the striker has just scored a goal, and so a chapter cue is inserted in the stream).
This may be better handled with standardized temporal metadata that would add a chapter point that references time in the past. This way no latency would have to be added for the purpose of a human adding a live chapter cue.
It has been proposed that file-wide metadata (see [DEV] or [CHANGE]) be stored at the top of the WebVTT file, and formatted as UNIX-style name-value pairs:
00:00:15.000 --> 00:00:17.950
File-wide metadata does not have a timestamp, so all the text (up to and excluding the linefeed separator that demarcates the file-wide metadata and the first cue) could be stored in the CodecPrivate sub-element of the Track element.
Note that a metadata cue [META] is the same as other WebVTT cues, with the difference that the text has no particular interpretation, except as generic text.
Default Cue Settings
The cue settings are attached to the timestamp line, and it has been suggested (see [DEV] or [CHANGE]) that the syntax be modified to allow for default cue settings to be specified. The timestamp line retains the distinguished arrow symbol ("-->"), but the actual timestamps are omitted:
DEFAULTS --> D:vertical A:end
00:00.000 --> 00:02.000
This is vertical and end-aligned.
Block elements must have a timestamp value, so it’s not clear how this cue should be embedded in the WebM stream. One idea is to use the start time of following cue, and embed the cue as a Block either within a BlockGroup that omits the BlockDuration element, or simply use a SimpleBlock element instead (possible here because no BlockDuration need be present).
Another idea is to embed both the default settings cue in the same block as the following (normal) cue. This might complicate the demuxer, though, since there is no longer a one-to-one correspondence between a WebVTT cue and a WebM block.
Inline CSS and Comments could be handled similarly.
The presence of default cue settings implies that the state of the subtitle rendering system depends on everything that has come before (at least, what default cue settings have come before). One problem is that seeking into the middle of the stream will break the rendering, because the seek will skip over cues that potentially specify new default settings.
Assuming that seeking is desired, there are at least a couple of ways to handle this. One way is to simply not write any default settings cues into the WebM track. Instead, write the settings explicitly, in the normal way, on the same line as the timestamp. If a cue overrides the default, then that value would be preserved; otherwise, write the current default value.
Another way is to write a default settings cue that is the union of all current defaults, whenever you write a block that is the target of a WebM Cue (typically a video keyframe). A muxer will have to make similar arrangements anyway, to ensure that the WebVTT cues associated with a video frame are placed on the same cluster as the frame itself (similar to the same rule we have for audio).
In both cases, the muxer will have to a be WebVTT interpreter too, since each cue will need to be parsed to determine whether it specifies a default settings cue, and then actual default settings will need to be parsed.
Yet another possibility is that the default settings cues could be embedded separately, in a different part of the file, say in the CodecPrivate area. During the demuxing phase, the original stream of cues could be reconstituted from the set of settings cues and the normal cues. This still means more work for the muxer, however, because it must parse enough of the cue to determine what kind of cue it is, and then parse the settings too. (No matter what, the muxer will need to parse the timestamp of the cue, in order to synthesize the timestamp of the block, so perhaps the extra parsing is not too much of a greater burden.)
WebVTT has explicit support for a metadata track, but it’s not clear exactly what WebVTT metadata looks like. Is it name/value pairs? In any event, WebM will probably have to standardize a few of the names, no matter how it is formatted.
There might also be interest in supporting XMP (see [XMP1CORE] and [XMP2PROP]), RDF (see for example [RDF]), JSON, or some other stylized form of metadata the payload of a metadata cue. For example, it might be possible to embed, say, the GPS coordinates of the video track using XMP as follows:
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.1.2">
(In this example, it’s not clear whether the markup needs to escaped using the standard ampersand sequence.)
Of course if you’re only interest is in GPS data, then an XMP representation might have too much storage overhead, and a plain WebVTT metadata cue, using name/value pairs, might be adequate. Following a Vorbis comment example (see "Geo Location fields" in [OGGCOMM]), it would look something like:
See [GEO] for more information about GPS microformat.
[WEBVTT] WebVTT Living Standard (accessed 12 Jan 2012)
[MKVSRT] SRT Subtitles
[MKVCODECID] Matroska Codec Specs
[XMP1CORE] XMP Specification, Part 1, Data Model, Serialization, and Core Properties
[XMP2PROP] XMP Specification, Part 2, Additional Properties
[RDF] Embedding RDF in WebVTT