Red Hen Command-Line Tools

Red Hen uses the Linux operating system (Debian) and in some cases the Mac OS X shell with the standard GNU utilities from MacPorts.

While we work continually to expose more of the Red Hen dataset to the search engine interfaces, some data types and certain operations of search and analysis can be accessed only from the command line. Much of the cutting-edge work is done directly on the servers, and graduate students and researchers may have projects that benefit from command line access.

Navigation

The Cartago server keeps two versions of NewsScape -- a text-only tree and a full tree with video and images. For text only, you navigate using the command "dday":

dday 2013-05-03

dday 5 (for five days ago)

dday - 1 (for one day earlier -- spaces on both sides of the minus sign)

dday + 4 (for four days later)

dday 2015-04-01_1600_US_CNN_Legal_View_With_Ashleigh_Banfield (the day of a file)

Similarly, for the full tree with video and images, you navigate using the command "sweep":

day 2013-05-03

The letter l (lowercase L) is an alias for ls -Ll to list files.

Standard tools

On the command line, you can use GNU core utilities like grep, sed, awk, regular expressions, find, cut, xargs, and tr to search the text. In the default bash shell, you also have some additional functionality, such as string manipulation. For instance, to examine all the Named Entities in the annotated CNN files for one day, issue

grep '|NER' *CNN*seg > ~/${PWD##*/}_CNN_NER.txt

The redirect symbol is '>' -- it sends the output to a file that you name. To indicate your home directory, use '~/'. To include the date of the files you are grepping, you can use the chopped string ${PWD##*/} -- try issuing this in any directory:

echo $PWD

echo ${PWD##*/}

The variable $PWD is a so-called environment variable, which contains local information about your context, in this case the name of the present working directory -- the same as if you issue

pwd

Then ${PWD##*/} chops off the path, leaving just the present directory name. You can also save the output -- say of all frames in a file -- to an identically named file with a new extension, as in this for loop:

for FIL in `ls -1 *CNN*seg` ; do grep '|FRM' $FIL > ~/${FIL%.*}.frames ; done

where ${FIL%.*} chops the extension off of $FIL.

Video and image tools

  • clip creates a video clip from a given start and end time
  • peck-clip creates a video clip of each hit in a peck search result (see peck in the next section)
  • extract-frames extracts images at a requested interval from a video created by clip or peck-clip
  • gifmaker creates an animated gif from a video clip

Search tools

In addition to the core GNU utilities, Red Hen has developed the dedicated search utilities peck, peck-segment, peck-filter, peck-intersect, and peck-clip. Union is also available as a use of some standard unix commands. These search utilities perform searches within the several forms of annotation present in the NewsScape corpus; see Current state of text tagging.

  • Peck searches files for regex patterns by primary tag. The output files (which we call seeds) present one hit or seed per line. Each seed is a sentence with metadata. The seeds are produced in csv files with delimiter = |. Accordingly, seeds can be imported to a statistical software package such as R.
  • Peck-segment searches for a regex pattern within a segment type, such as commercials.
  • Peck-intersect finds the intersection of different seeds produced by peck and/or peck-segment.
  • Peck-filter filters seeds by a second set of search criteria. It can be run before or after peck-intersect. Peck-filter just removes search results that don't satisfy some second criteria; it's like putting your search through a strainer.
  • Peck-clip creates a video clip of each seed.
  • Union finds the union of different seeds, such as seeds that draw on different primary tags. Union is not a separate script but a recipe for using standard unix commands. To find the union of seeds, concatenate and sort them (using *nix system calls cat and sort). Filenames for news broadcasts in Red Hen follow a specific order, including date, time, country, name of show, etc. A line in a seed has in its first field the filename, and in its second the start-timestamp for the beginning of the utterance. Since sort operates on the entire line, it will by default sort all lines in a seed first by filename and, within filename, by timestamp. This sorting pattern can be forced with cat filenames | sort -t '|' -k 1,1 -k 2,2 > <output file>.
  • A line in a seed has the following form:
    • Filename | expression | hot-link to the moment in the broadcast when the expression is uttered, so one can see the full audiovisual presentation and performance | start-timestamp | end-timestamp | primary tag on which the search was conducted | output of the hit. Example below:
    • 2014-10-12_0000_US_KNBC_NBC_Nightly_News.seg|>> I THINK YOU GOT TO HAVE THE BAD DAYS SO YOU CAN LOVE THE GOOD DAYS EVEN MORE.|https://tvnews.sscnet.ucla.edu/edge/video,7625e7bc-51ab-11e4-b579-089e01ba0326,3338|20141012005538.967|20141012005543.038|POS_02|I/PRP|THINK/VBP|YOU/PRP|GOT/VBD|TO/TO|HAVE/VB|THE/DT|BAD/JJ|DAYS/NNS|SO/IN|YOU/PRP|CAN/MD|LOVE/VB|THE/DT|GOOD/JJ|DAYS/NNS|EVEN/RB|MORE./VBP|

For example,

use peck to search for the construction "ProperNoun is the (title)(ProperNounString) of":

turnerstudents@cartago:~$dday 730 ; for D in {730..3} ; do DAY=$(pwd | grep -o '..........$') ; peck seg '\/NNP\|is\/VBZ\|the\/DT(\|Mister|\|Mr.|\|Mrs.|\|Miss|\|Ms.|\|Lord|\|Lady|\|Professor|\|Prof.|\|Senator|\|President|\|Governor|\|King|\|Queen|\|General|\|Colonel|\|Pope)?(\/\w+)?(\|\w+\/NNP)+(\|the great|\|Junior|\|Senior|\|III|\|Ph.D.|\|PhD)?(\/\w+)?\|of\/' POS_02 ~/turner/NN_$DAY ; unset DAY; dday + 1 ; done

which produces a massive file of hits like this:

2013-04-04_0630_US_KOCE_Charlie_Rose.seg|>> HANKS IS THE BRUCE SPRINGSTEEN OF ACTORS.|https://tvnews.sscnet.ucla.edu/edge/video,707ede62-9cf9-11e2-a48c-001517add720,3253|20130404072413.000|20130404072416.000|POS_02|HANKS/NNP|IS/VBZ|THE/DT|BRUCE/NNP|SPRINGSTEEN/NNP|OF/IN|ACTORS./NNP|

Each command operates on a day's worth of files, typically around a hundred news shows. Let's say we start with two peck searches. First we look for instances of the word "time" in the frame annotations:

peck seg "time" FRM_01 ~/time-frames.csv

Instead of an open-ended search for the word time, we can also force just the frame TIME:

peck seg "FRM_01\|TIME\|" FRM_01 ~/TIME-frame.csv

or "TIME" as a semantic role:

peck seg "SRL\|TIME\|" FRM_01 ~/TIME-SRL.csv

or some relevant frame element:

peck seg "\|Measure_duration" FRM_01 ~/Measure_duration.csv

Then we look in the parts-of-speech annotations for a particular construction, say the indefinite article (a or an) followed by an adjective:

peck seg "an?\|[a-zA-Z]+/JJ" POS_01 ~/a-JJ.csv

Once we have these seeds, we can use intersect to find temporal expressions in sentences that contain this particular construction:

peck-intersect ~/TIME-frame.csv ~/a-JJ.csv ~/a-JJ-TIME.csv

This file can be read into R for statistical analysis. For some purposes, we'd want to remove multiple annotations of the same caption line. Each line contains a link to the location in the video where the sentence was spoken; this allows us to do multimodal research on constructions.

The very simple logical architecture of peck and intersect is quite powerful. Intersect can be used recursively, so we can build a complex search in multiple steps, starting with peck:

peck seg "search 1" FRM_01 ~/Out1.csv

peck seg "search 2" POS_01 ~/Out2.csv

peck seg "search 3" NER_03 ~/Out3.csv

and so on

and then get the intersection of all of them through pairwise recursion (intersect is intransitive):

peck-intersect ~/Out1.csv ~/Out2.csv ~/Out12.csv

peck-intersect ~/Out3.csv ~/Out12.csv ~/Out13.csv

and so on

These tools are sufficiently powerful to allow us to research linguistic constructions and their multimodal dimensions in ways that have not been possible before.

peck

$ peck -h

* * * Red Hen Commandline Search Widget * * *

Search for annotations in the NewsScape corpus (2004-03-01 to present).

The corpus is annotated along multiple dimensions in different file types:

seg: Sentiment (SMT_01 and SMT_02), CLiPS MBSP (POS_01),

Stanford Parts of Speech (POS_02), Named Entities (NER_03), and FrameNet (FRM_01).

You can also search within the unannotated caption text directly (CC).

The parts-of-speech annotations use the Penn Treebank II tag set,

see

  1. http://www.clips.ua.ac.be/pages/mbsp-tags,
  2. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  3. https://www.cis.upenn.edu/~treebank/

ocr: On-screen text (OCR1)

tpt: Segment boundaries (SEG), Named Entities (NER_01), some others.

tag: Selectively hand-annotated Gestures (GES), Causal reasoning stages (CAU), and others;

see How to use the online tagging interface

The script searches the requested file type in the current directory.

To search files on a particular day, first go to that directory. Navigate like this:

tvnews 2014-07-22 or tvnews 5 (five days ago); tvnews + 3 or tvnews - 698 (relative dates).

Syntax (put the search phrase or regular expression inside double quotes):

peck <file name or type> <"regex search terms"> <primary tag> <output file> [clobber]

Examples (clobber to overwrite existing file -- note that peck is not case sensitive):

peck seg "time" FRM_01 ~/MyResults.csv clobber (any mention of time in a frame annotation)

peck ocr "Obama" OCR1 ~/MyOCRResults.csv (any mention of Obama in on-screen text)

For more search examples and advanced operations, see peck -h2.

$ peck -h2

* * * Red Hen Commandline Search Widget -- Advanced * * *

Search for annotations in the NewsScape corpus (2004-06-01 to present).

Syntax (put the search phrase or regular expression inside double quotes):

peck <file name or type> <"regex search terms"> <primary tag> <output file> [clobber]

Examples (clobber to overwrite existing file -- note that peck is not case sensitive):

peck seg "time" FRM_01 ~/MyResults.csv clobber (any mention of time in a frame annotation)

peck seg "FRM_01\|TIME\|" FRM_01 ~/MyResults.csv (only the frame name -- escape the pipe symbol |)

peck seg "SRL\|TIME\|" FRM_01 ~/MyResults.csv (only the semantic role)

peck seg "\|Measure_duration" FRM_01 ~/MyResults.csv (only a known frame element

peck seg "a\|[a-zA-Z]+/JJ" POS_01 ~/MyResults.csv (the indefinite article followed by an adjective

peck seg "\/NNP\|is\/VBZ\|the\/DT\|[a-zA-Z]+\/NNP\|[a-zA-Z]+\/NNP\|of\/IN\|" POS_02 ~/MyResults.csv

The output is a comma-separated values file that can be read straight into R:

MyR <- read.csv("~/MyResults.csv", sep="|", quote=NULL)

You can also combine searches in different tags (union of searches, with duplicates removed):

peck seg "tornado" POS_01 ~/Out1 ; peck seg "likelihood" FRM_01 ~/Out2 ; sort -u ~/Out1 ~/Out2 > ~/Out3

Get the intersection of two searches (can be applied recursively):

peck seg "tornado" POS_01 ~/Out1 ; peck seg "likelihood" FRM_01 ~/Out2 ; intersect ~/Out1 ~/Out2 ~/Out3

To search a series of days, walk through a number of days from some starting point and add up the results (union):

dday 365 ; for D in {365..1} ; do peck seg "as.*if.*PP" POS_01 ~/A ; cat ~/A >> ~/MyResults ; rm ~/A ; dday + 1 ; done

If you do not clobber, peck will rename the OUTFIL, which also allows you to run concatenate and intersect afterwards:

dday 23 ; for D in {23..2} ; do peck seg "as.*if.*PP" POS_01 ~/MyResults ; dday + 1 ; done

In conjunction with intercept, the script can be used to successively refine a construction.

The script will add the current directory name and a timestamp to avoid clobbering.

To create clips from the search results, use peck2clip.

To search within a certain type of segment only, such as Commercials, use peck-seg.

peck-seg

Search for text or constructions within a segment type, such as commercials.

$ peck-seg -h

* * * Red Hen Commandline Segment Search Widget * * *

Search for text or annotations within a segment type in the NewsScape corpus (2004-03-01 to present).

The corpus is annotated along multiple dimensions in different file types:

seg: Sentiment (SMT_01 and SMT_02), CLiPS MBSP (POS_01),

Stanford Parts of Speech (POS_02), Named Entities (NER_03), and FrameNet (FRM_01).

You can also search within the unannotated caption text directly (CC).

The parts-of-speech annotations use the Penn Treebank II tag set,

see http://www.clips.ua.ac.be/pages/mbsp-tags.

ocr: On-screen text (OCR1)

tpt: Segment boundaries (SEG), Named Entities (NER_01), some others.

tag: Selectively hand-annotated Gestures (GES), Causal reasoning stages (CAU), and others;

see How to use the online tagging interface

The script searches the requested file type in the current directory.

To search files on a particular day, first go to that directory. Navigate like this:

tvnews 2014-07-22 or tvnews 5 (five days ago); tvnews + 3 or tvnews - 698 (relative dates).

Syntax (put the search phrase or regular expression inside double quotes):

peck-seg <file type> <segment type> <"regex search terms"> <primary tag> <output file> [clobber]

Examples (clobber to overwrite existing file -- note that peck is not case sensitive):

peck-seg txt Commercial "calamari" CC ~/MyResults.csv clobber (any mention of calamari in a commercial)

If you do not clobber, peck will rename the OUTFIL, which also allows you to run concatenate and intersect afterwards.

For example, go back to June 2007 and look for the iPhone calamari ads for the next 70 days:

dday 2007-06-01 ; for D in {1..70} ; do peck-seg txt Commercial "calamari" CC ~/peck-seg/calamari-ads.csv ; dday + 1 ; done

See also peck and intersect.

peck-filter

The peck-filter script takes an existing seed file -- the csv output of a peck or peck-segment search -- and adds a second set of search criteria. The output is a file of seeds that meet both criteria -- a strict subset of the original search. peck-filter can be run before or after peck-intersect, and allows you to define complex combinations of search conditions.

peck-filter -h

* * * Red Hen Commandline Search Filter * * *

Filter a peck search result by a second set of search criteria.

See peck -h and peck -h2 for basic instructions.

Syntax:

peck-filter <peck search result csv> <file type> <"regex search terms"> <primary tag> <output file> [clobber]

Examples (start with a peck search results and refine it):

peck-filter ~/MyResults1.csv seg "a\|[a-zA-Z]+/JJ" POS_01 ~/FilteredResults.csv

The script produces results that match both set of criteria.

See also peck, peck-clip, peck-intersect, and peck-seg.

peck-intersect

The intersect script operates on two pecks, and can be used recursively:

$ intersect -h

* * * Red Hen Commandline Search Intersection * * *

Generate the intersection of two peck search results.

Syntax:

intersect <input file #1> <input file #2> <output file> [clobber]

Examples (clobber to overwrite existing output file):

intersect ~/MyResults1.csv ~/MyResults2.csv ~/Intersection.csv

The script also handles iterative intersections.

We can now do repeated peck searches and combine the results (per-show OR), or intersect two searches (per-show AND). Peck-filter is a new mix-and-match module that handles sentence-level AND conditionality -- it takes the timestamps from a peck result and looks for patterns only under that timestamp, which is to say, one sentence.


The scripts peck, peck-segment, peck-filter, and peck-intersect are designed for automation and can be run incrementally. We could write scripts that call them repeatedly at different dates and create visualizations on the fly: the morning news in a new form.

Let's say we create two or more peck searches that run on every day of the corpus, and we use intersect to locate some complex construction. We aggregate the result from every day into a single csv file. Then we set up a crontab that runs the same pecks with intersect at 2am on incoming files. We add the output to this single csv file, pipe it to R, generate a graph, and post it online. Instant construction updates or even discovery. For instance, we could run a monitor for "because NOUN" and see when the media start using it. You could get an e-mail when it's spotted.

We might even be able to create clickable graphs that allow people to access the underlying communicative act from the graph, as an access interface.

Using Regex to Locate a Pattern in the Tagged Data

The possibilities for tagging and search in Red Hen are unlimited. To begin to use the command-line tools for search, it is indispensable to

  1. Acquire a rudimentary familiarity with regular expressions (regex). There are many gentle introductions and tutorials, such as http://www.regular-expressions.info/tutorialcnt.html. Learning regex is a little like learning long division or how to factor a quadratic formula: it is for the most part easy, but takes some study and practice. Basic work can be done in the Edge Search Engine, but advanced work requires regex.
  2. Acquire a rudimentary familiarity with the Red Hen tagging schemes. Study these pages: — Current state of text tagging and http://www.clips.ua.ac.be/pages/mbsp-tags. When data in a file are tagged, the tagging is kept in its paired .seg file. Become familiar with the structure and contents of a typical .seg file. See Examples of .seg files.
  3. Begin to use regex to match tagging. Finding a simple alphanumeric string, e.g. "Napoleon," can be done directly in the Edge Search Engine. One can also use the Edge Search Engine for some basic boolean searches and some basic regex searches. See — How to use the Edge search engine. But advanced work usually requires working from the command line to conduct a regex search on a .seg file (or an .ocr file if you want to search on-screen text). For example, one could use the Edge Search Engine to find the strings "a dog" or "the dog" or even EITHER "a dog" OR "the dog." But let us take a trivial example of a search for something other than words: If you wanted to find examples of a determiner (a or an or the or these, etc.) followed by a singular noun, you would need to know that determiners are marked DT and that singular nouns are marked NN. You would need to know that MBSP (POS_01) would tag "a dog" as |a/DT/I-NP/O/a|dog/NN/I-NP/O/dog|, or, somewhat easier in this case, that the Stanford Part-of-Speech tagger (POS_02) would tag "a dog" as |A/DT|DOG/NN|. So, one would need to use regex to search for '\|[a-zA-Z]+\/DT\|[a-zA-Z]+\/NN\|' if you wanted to find a determiner followed by a singular noun. One could also use the equivalent regex pattern '\|\w+\/DT\|\w+\/NN\|'. Using peck, the command would be peck seg '\|\w+\/DT\|\w+\/NN\|' POS_02 <output file>. This says, "go pecking through the seg files in a directory for the following pattern: a pipe (\|) followed by a word (\w+) followed by the tag for a determiner (\/DT) followed by a pipe (\|) followed by a word (\w+) followed by the tag for a singular noun (\/NN) followed by a pipe (\|); and do this in lines that have as primary tag POS_02 (that means, tagged by the Stanford POS parser); and put all the hits in the named output file. (By the way, please do not run this search as practice; it will produce very many hits.) It produces hits like the following example, which actually has two sections that match the pattern, marked in red:
  4. 2015-02-05_1200_US_KNBC_KNBC_Early_Today.seg|WHY WAS AN SUV STOPPED ON THE TRACKS?|https://tvnews.sscnet.ucla.edu/edge/video,ca43190e-ad32-11e4-ac58-089e01ba0326,22|20150205120022.388|20150205120025.190|POS_02|
  5. WHY/WRB|WAS/VBD|AN/DT|SUV/NN|STOPPED/VBD|ON/RP|THE/DT|TRACKS?/NN

Note also that peck and our other command-line tools deliver character-separated value files (where the separator is a pipe |), so they can be imported directly into the statistical software package R for analysis and graphic presentation.

Further Reading