The Red Hen data format has certain drawbacks for the use as a corpus that are to be remedied by the Red Hen corpus data format. It follows a relatively standard and easy-to-process pattern called "vertical format" that is used as input format by many corpus managers, e.g. CQP (the backend to CQPweb) or manatee (the backend to the SketchEngine and NoSektchEngine).
There are two levels of representation in the Red Hen corpus data format:
Thus, in the following example, the lines containing the actual text have the p-attribute word in the first column, the p-attribute Part-of-Speech (in the Lancaster CLAWS5 tagset) in the second and the p-attribute lemma (i.e. the base form of the word) in the third column. The columns are not labeled in the file itself. The s-attributes in this snippet are text and s (s-unit, "a sentence-like division of a text", TEI). They can have multiple attributes (things like id, title, author, publisher) themselves.
<text id="file0001" title="Cat stories" author="Tom Cat" publisher="Feline Press"><s id="1">The AT0 thecat NN1 catsat VVD siton PRP onthe AT0 themat NN1 mat. PUN .</s><s id="2">Its DPS itsname NN1 nameis VBZ bePi NP0 pi. PUN .</s></text>The attributes within the <text> s-attribute correspond to the header information in the Red Hen data format. However, since every attribute can only have one value, fields with multiple values are distributed over multiple attributes (see the example of VID below).
[TODO: ADD EXAMPLE HERE]
Example:
<s id="s__d3fd32de_e3e5_11e3_857a_001fc65c7848__1" starttime="20040301140315.450" reltime="195">[Detailed version may come in the future.]
There are various types of story segmentation in the archive that have been done by different people based on different cues. These are currently not consistent enough to include them in the corpus. Nonetheless the SEG tags are used by the sentence tagger to make sure no sentence crosses a SEG boundary. This practice may have to be revised in the future given that some of the SEG tags for commercials enclose only parts of sentences, as in the following example from the file 2008-05-05_2230_US_KCAL_Inside_Edition.txt.
20080505225945.286|20080505225946.000|CCO|WELCOME TO "KCAL 9 NEWS" AT20080505225946.000|20080505225946.714|CCO|20080505225946.714|20080505225947.429|CCO|4:00.20080505225947.429|20080505225948.143|CCO|20080505225948.143|20080505225948.857|CCO|ALSO STREAMING LIVE ON20080505225948.857|20080505225951.000|SEG_01|Type=Commercial20080505225948.857|20080505225949.571|CCO|20080505225949.571|20080505225950.286|CCO|www.kcal9.com.20080505225950.286|20080505225951.000|CCO|20080505225951.000|20080505225952.250|SEG_01|Type=Story startFor the moment the various chevrons are indicated formally by a standalone tag, i.e. <storyboundary /> and <turnboundary />.
It is quite common to identify the speaker in closed captions, as in the following example from 2012-05-05_0000_US_KNBC_Channel_4_News.txt:
20120505000037.000|20120505000041.000|CC1|>> I DON'T WANT YOU TO HURT 20120505000041.000|20120505000041.000|CC1|YOURSELF. 20120505000041.000|20120505000043.000|CC1|>> Reporter: FOUR MINUTES LATER,20120505000043.000|20120505000046.000|CC1|PARAMEDICS ARRIVED. Currently we detect a range of these with a whitelist, including "Reporter:" or "Translator:". These are stored in a stand-alone tag <speakeridentification speaker="Translator" /> etc. for now.
[or should we do this as p-attributes, possibly similar to ditto tags?]