Red Hen corpus data format
Introduction
The Red Hen data format has certain drawbacks for the use as a corpus that are to be remedied by the Red Hen corpus data format. It follows a relatively standard and easy-to-process pattern called "vertical format" that is used as input format by many corpus managers, e.g. CQP (the backend to CQPweb) or manatee (the backend to the SketchEngine and NoSektchEngine).
Related
- Overview of research (with dataset description)
- Red Hen data format
- Edge Search Engine Documentation (lists searchable tags)
- How to use the Edge2 search engine
- Current state of text tagging
Basic Concepts
There are two levels of representation in the Red Hen corpus data format:
- Token-level annotation, i.e. every word/punctuation mark has annotation, e.g. Part-of-Speech, lemma, etc. These are called p[ositional]-attributes in CQP.
- Annotation potentially spanning multiple words or not directly related to individual words, such as texts, sentences, pauses, gestures, etc. These are called s[tructural]-attributes in CQP.
Thus, in the following example, the lines containing the actual text have the p-attribute word in the first column, the p-attribute Part-of-Speech (in the Lancaster CLAWS5 tagset) in the second and the p-attribute lemma (i.e. the base form of the word) in the third column. The columns are not labeled in the file itself. The s-attributes in this snippet are text and s (s-unit, "a sentence-like division of a text", TEI). They can have multiple attributes (things like id, title, author, publisher) themselves.
<text id="file0001" title="Cat stories" author="Tom Cat" publisher="Feline Press">
<s id="1">
The AT0 the
cat NN1 cat
sat VVD sit
on PRP on
the AT0 the
mat NN1 mat
. PUN .
</s>
<s id="2">
Its DPS its
name NN1 name
is VBZ be
Pi NP0 pi
. PUN .
</s>
</text>
File-level metadata
The attributes within the <text> s-attribute correspond to the header information in the Red Hen data format. However, since every attribute can only have one value, fields with multiple values are distributed over multiple attributes (see the example of VID below).
[TODO: ADD EXAMPLE HERE]
Sentence-level metadata
Example:
<s id="s__d3fd32de_e3e5_11e3_857a_001fc65c7848__1" starttime="20040301140315.450" reltime="195">
Further annotation with s-attributes
Story segmentation
[Detailed version may come in the future.]
There are various types of story segmentation in the archive that have been done by different people based on different cues. These are currently not consistent enough to include them in the corpus. Nonetheless the SEG tags are used by the sentence tagger to make sure no sentence crosses a SEG boundary. This practice may have to be revised in the future given that some of the SEG tags for commercials enclose only parts of sentences, as in the following example from the file 2008-05-05_2230_US_KCAL_Inside_Edition.txt.
20080505225945.286|20080505225946.000|CCO|WELCOME TO "KCAL 9 NEWS" AT
20080505225946.000|20080505225946.714|CCO|
20080505225946.714|20080505225947.429|CCO|4:00.
20080505225947.429|20080505225948.143|CCO|
20080505225948.143|20080505225948.857|CCO|ALSO STREAMING LIVE ON
20080505225948.857|20080505225951.000|SEG_01|Type=Commercial
20080505225948.857|20080505225949.571|CCO|
20080505225949.571|20080505225950.286|CCO|www.kcal9.com.
20080505225950.286|20080505225951.000|CCO|
20080505225951.000|20080505225952.250|SEG_01|Type=Story start
For the moment the various chevrons are indicated formally by a standalone tag, i.e. <storyboundary /> and <turnboundary />.
Speaker identification
It is quite common to identify the speaker in closed captions, as in the following example from 2012-05-05_0000_US_KNBC_Channel_4_News.txt:
20120505000037.000|20120505000041.000|CC1|>> I DON'T WANT YOU TO HURT
20120505000041.000|20120505000041.000|CC1|YOURSELF.
20120505000041.000|20120505000043.000|CC1|>> Reporter: FOUR MINUTES LATER,
20120505000043.000|20120505000046.000|CC1|PARAMEDICS ARRIVED.
Currently we detect a range of these with a whitelist, including "Reporter:" or "Translator:". These are stored in a stand-alone tag <speakeridentification speaker="Translator" /> etc. for now.
Named Entity Recognition
[or should we do this as p-attributes, possibly similar to ditto tags?]