Red Hen uses the Linux operating system (Debian) and in some cases the Mac OS X shell with the standard GNU utilities from MacPorts.
While we work continually to expose more of the Red Hen dataset to the search engine interfaces, some data types and certain operations of search and analysis can be accessed only from the command line. Much of the cutting-edge work is done directly on the servers, and graduate students and researchers may have projects that benefit from command line access.
The Cartago server keeps two versions of NewsScape -- a text-only tree and a full tree with video and images. For text only, you navigate using the command "dday":
dday 2013-05-03
dday 5 (for five days ago)
dday - 1 (for one day earlier -- spaces on both sides of the minus sign)
dday + 4 (for four days later)
dday 2015-04-01_1600_US_CNN_Legal_View_With_Ashleigh_Banfield (the day of a file)
Similarly, for the full tree with video and images, you navigate using the command "sweep":
day 2013-05-03
The letter l (lowercase L) is an alias for ls -Ll to list files.
On the command line, you can use GNU core utilities like grep, sed, awk, regular expressions, find, cut, xargs, and tr to search the text. In the default bash shell, you also have some additional functionality, such as string manipulation. For instance, to examine all the Named Entities in the annotated CNN files for one day, issue
grep '|NER' *CNN*seg > ~/${PWD##*/}_CNN_NER.txt
The redirect symbol is '>' -- it sends the output to a file that you name. To indicate your home directory, use '~/'. To include the date of the files you are grepping, you can use the chopped string ${PWD##*/} -- try issuing this in any directory:
echo $PWD
echo ${PWD##*/}
The variable $PWD is a so-called environment variable, which contains local information about your context, in this case the name of the present working directory -- the same as if you issue
pwd
Then ${PWD##*/} chops off the path, leaving just the present directory name. You can also save the output -- say of all frames in a file -- to an identically named file with a new extension, as in this for loop:
for FIL in `ls -1 *CNN*seg` ; do grep '|FRM' $FIL > ~/${FIL%.*}.frames ; done
where ${FIL%.*} chops the extension off of $FIL.
In addition to the core GNU utilities, Red Hen has developed the dedicated search utilities peck, peck-segment, peck-filter, peck-intersect, and peck-clip. Union is also available as a use of some standard unix commands. These search utilities perform searches within the several forms of annotation present in the NewsScape corpus; see Current state of text tagging.
For example,
use peck to search for the construction "ProperNoun is the (title)(ProperNounString) of":
turnerstudents@cartago:~$dday 730 ; for D in {730..3} ; do DAY=$(pwd | grep -o '..........$') ; peck seg '\/NNP\|is\/VBZ\|the\/DT(\|Mister|\|Mr.|\|Mrs.|\|Miss|\|Ms.|\|Lord|\|Lady|\|Professor|\|Prof.|\|Senator|\|President|\|Governor|\|King|\|Queen|\|General|\|Colonel|\|Pope)?(\/\w+)?(\|\w+\/NNP)+(\|the great|\|Junior|\|Senior|\|III|\|Ph.D.|\|PhD)?(\/\w+)?\|of\/' POS_02 ~/turner/NN_$DAY ; unset DAY; dday + 1 ; done
which produces a massive file of hits like this:
2013-04-04_0630_US_KOCE_Charlie_Rose.seg|>> HANKS IS THE BRUCE SPRINGSTEEN OF ACTORS.|https://tvnews.sscnet.ucla.edu/edge/video,707ede62-9cf9-11e2-a48c-001517add720,3253|20130404072413.000|20130404072416.000|POS_02|HANKS/NNP|IS/VBZ|THE/DT|BRUCE/NNP|SPRINGSTEEN/NNP|OF/IN|ACTORS./NNP|
Each command operates on a day's worth of files, typically around a hundred news shows. Let's say we start with two peck searches. First we look for instances of the word "time" in the frame annotations:
peck seg "time" FRM_01 ~/time-frames.csv
Instead of an open-ended search for the word time, we can also force just the frame TIME:
peck seg "FRM_01\|TIME\|" FRM_01 ~/TIME-frame.csv
or "TIME" as a semantic role:
peck seg "SRL\|TIME\|" FRM_01 ~/TIME-SRL.csv
or some relevant frame element:
peck seg "\|Measure_duration" FRM_01 ~/Measure_duration.csv
Then we look in the parts-of-speech annotations for a particular construction, say the indefinite article (a or an) followed by an adjective:
peck seg "an?\|[a-zA-Z]+/JJ" POS_01 ~/a-JJ.csv
Once we have these seeds, we can use intersect to find temporal expressions in sentences that contain this particular construction:
peck-intersect ~/TIME-frame.csv ~/a-JJ.csv ~/a-JJ-TIME.csv
This file can be read into R for statistical analysis. For some purposes, we'd want to remove multiple annotations of the same caption line. Each line contains a link to the location in the video where the sentence was spoken; this allows us to do multimodal research on constructions.
The very simple logical architecture of peck and intersect is quite powerful. Intersect can be used recursively, so we can build a complex search in multiple steps, starting with peck:
peck seg "search 1" FRM_01 ~/Out1.csv
peck seg "search 2" POS_01 ~/Out2.csv
peck seg "search 3" NER_03 ~/Out3.csv
and so on
and then get the intersection of all of them through pairwise recursion (intersect is intransitive):
peck-intersect ~/Out1.csv ~/Out2.csv ~/Out12.csv
peck-intersect ~/Out3.csv ~/Out12.csv ~/Out13.csv
and so on
These tools are sufficiently powerful to allow us to research linguistic constructions and their multimodal dimensions in ways that have not been possible before.
$ peck -h
* * * Red Hen Commandline Search Widget * * *
Search for annotations in the NewsScape corpus (2004-03-01 to present).
The corpus is annotated along multiple dimensions in different file types:
seg: Sentiment (SMT_01 and SMT_02), CLiPS MBSP (POS_01),
Stanford Parts of Speech (POS_02), Named Entities (NER_03), and FrameNet (FRM_01).
You can also search within the unannotated caption text directly (CC).
The parts-of-speech annotations use the Penn Treebank II tag set,
see
ocr: On-screen text (OCR1)
tpt: Segment boundaries (SEG), Named Entities (NER_01), some others.
tag: Selectively hand-annotated Gestures (GES), Causal reasoning stages (CAU), and others;
see How to use the online tagging interface
The script searches the requested file type in the current directory.
To search files on a particular day, first go to that directory. Navigate like this:
tvnews 2014-07-22 or tvnews 5 (five days ago); tvnews + 3 or tvnews - 698 (relative dates).
Syntax (put the search phrase or regular expression inside double quotes):
peck <file name or type> <"regex search terms"> <primary tag> <output file> [clobber]
Examples (clobber to overwrite existing file -- note that peck is not case sensitive):
peck seg "time" FRM_01 ~/MyResults.csv clobber (any mention of time in a frame annotation)
peck ocr "Obama" OCR1 ~/MyOCRResults.csv (any mention of Obama in on-screen text)
For more search examples and advanced operations, see peck -h2.
$ peck -h2
* * * Red Hen Commandline Search Widget -- Advanced * * *
Search for annotations in the NewsScape corpus (2004-06-01 to present).
Syntax (put the search phrase or regular expression inside double quotes):
peck <file name or type> <"regex search terms"> <primary tag> <output file> [clobber]
Examples (clobber to overwrite existing file -- note that peck is not case sensitive):
peck seg "time" FRM_01 ~/MyResults.csv clobber (any mention of time in a frame annotation)
peck seg "FRM_01\|TIME\|" FRM_01 ~/MyResults.csv (only the frame name -- escape the pipe symbol |)
peck seg "SRL\|TIME\|" FRM_01 ~/MyResults.csv (only the semantic role)
peck seg "\|Measure_duration" FRM_01 ~/MyResults.csv (only a known frame element
peck seg "a\|[a-zA-Z]+/JJ" POS_01 ~/MyResults.csv (the indefinite article followed by an adjective
peck seg "\/NNP\|is\/VBZ\|the\/DT\|[a-zA-Z]+\/NNP\|[a-zA-Z]+\/NNP\|of\/IN\|" POS_02 ~/MyResults.csv
The output is a comma-separated values file that can be read straight into R:
MyR <- read.csv("~/MyResults.csv", sep="|", quote=NULL)
You can also combine searches in different tags (union of searches, with duplicates removed):
peck seg "tornado" POS_01 ~/Out1 ; peck seg "likelihood" FRM_01 ~/Out2 ; sort -u ~/Out1 ~/Out2 > ~/Out3
Get the intersection of two searches (can be applied recursively):
peck seg "tornado" POS_01 ~/Out1 ; peck seg "likelihood" FRM_01 ~/Out2 ; intersect ~/Out1 ~/Out2 ~/Out3
To search a series of days, walk through a number of days from some starting point and add up the results (union):
dday 365 ; for D in {365..1} ; do peck seg "as.*if.*PP" POS_01 ~/A ; cat ~/A >> ~/MyResults ; rm ~/A ; dday + 1 ; done
If you do not clobber, peck will rename the OUTFIL, which also allows you to run concatenate and intersect afterwards:
dday 23 ; for D in {23..2} ; do peck seg "as.*if.*PP" POS_01 ~/MyResults ; dday + 1 ; done
In conjunction with intercept, the script can be used to successively refine a construction.
The script will add the current directory name and a timestamp to avoid clobbering.
To create clips from the search results, use peck2clip.
To search within a certain type of segment only, such as Commercials, use peck-seg.
Search for text or constructions within a segment type, such as commercials.
$ peck-seg -h
* * * Red Hen Commandline Segment Search Widget * * *
Search for text or annotations within a segment type in the NewsScape corpus (2004-03-01 to present).
The corpus is annotated along multiple dimensions in different file types:
seg: Sentiment (SMT_01 and SMT_02), CLiPS MBSP (POS_01),
Stanford Parts of Speech (POS_02), Named Entities (NER_03), and FrameNet (FRM_01).
You can also search within the unannotated caption text directly (CC).
The parts-of-speech annotations use the Penn Treebank II tag set,
see http://www.clips.ua.ac.be/pages/mbsp-tags.
ocr: On-screen text (OCR1)
tpt: Segment boundaries (SEG), Named Entities (NER_01), some others.
tag: Selectively hand-annotated Gestures (GES), Causal reasoning stages (CAU), and others;
see How to use the online tagging interface
The script searches the requested file type in the current directory.
To search files on a particular day, first go to that directory. Navigate like this:
tvnews 2014-07-22 or tvnews 5 (five days ago); tvnews + 3 or tvnews - 698 (relative dates).
Syntax (put the search phrase or regular expression inside double quotes):
peck-seg <file type> <segment type> <"regex search terms"> <primary tag> <output file> [clobber]
Examples (clobber to overwrite existing file -- note that peck is not case sensitive):
peck-seg txt Commercial "calamari" CC ~/MyResults.csv clobber (any mention of calamari in a commercial)
If you do not clobber, peck will rename the OUTFIL, which also allows you to run concatenate and intersect afterwards.
For example, go back to June 2007 and look for the iPhone calamari ads for the next 70 days:
dday 2007-06-01 ; for D in {1..70} ; do peck-seg txt Commercial "calamari" CC ~/peck-seg/calamari-ads.csv ; dday + 1 ; done
See also peck and intersect.
The peck-filter script takes an existing seed file -- the csv output of a peck or peck-segment search -- and adds a second set of search criteria. The output is a file of seeds that meet both criteria -- a strict subset of the original search. peck-filter can be run before or after peck-intersect, and allows you to define complex combinations of search conditions.
peck-filter -h
* * * Red Hen Commandline Search Filter * * *
Filter a peck search result by a second set of search criteria.
See peck -h and peck -h2 for basic instructions.
Syntax:
peck-filter <peck search result csv> <file type> <"regex search terms"> <primary tag> <output file> [clobber]
Examples (start with a peck search results and refine it):
peck-filter ~/MyResults1.csv seg "a\|[a-zA-Z]+/JJ" POS_01 ~/FilteredResults.csv
The script produces results that match both set of criteria.
See also peck, peck-clip, peck-intersect, and peck-seg.
peck-intersect
The intersect script operates on two pecks, and can be used recursively:
$ intersect -h
* * * Red Hen Commandline Search Intersection * * *
Generate the intersection of two peck search results.
Syntax:
intersect <input file #1> <input file #2> <output file> [clobber]
Examples (clobber to overwrite existing output file):
intersect ~/MyResults1.csv ~/MyResults2.csv ~/Intersection.csv
The script also handles iterative intersections.
We can now do repeated peck searches and combine the results (per-show OR), or intersect two searches (per-show AND). Peck-filter is a new mix-and-match module that handles sentence-level AND conditionality -- it takes the timestamps from a peck result and looks for patterns only under that timestamp, which is to say, one sentence.
The scripts peck, peck-segment, peck-filter, and peck-intersect are designed for automation and can be run incrementally. We could write scripts that call them repeatedly at different dates and create visualizations on the fly: the morning news in a new form.
Let's say we create two or more peck searches that run on every day of the corpus, and we use intersect to locate some complex construction. We aggregate the result from every day into a single csv file. Then we set up a crontab that runs the same pecks with intersect at 2am on incoming files. We add the output to this single csv file, pipe it to R, generate a graph, and post it online. Instant construction updates or even discovery. For instance, we could run a monitor for "because NOUN" and see when the media start using it. You could get an e-mail when it's spotted.
We might even be able to create clickable graphs that allow people to access the underlying communicative act from the graph, as an access interface.
The possibilities for tagging and search in Red Hen are unlimited. To begin to use the command-line tools for search, it is indispensable to
2015-02-05_1200_US_KNBC_KNBC_Early_Today.seg|WHY WAS AN SUV STOPPED ON THE TRACKS?|https://tvnews.sscnet.ucla.edu/edge/video,ca43190e-ad32-11e4-ac58-089e01ba0326,22|20150205120022.388|20150205120025.190|POS_02|
WHY/WRB|WAS/VBD|AN/DT|SUV/NN|STOPPED/VBD|ON/RP|THE/DT|TRACKS?/NN
Note also that peck and our other command-line tools deliver character-separated value files (where the separator is a pipe |), so they can be imported directly into the statistical software package R for analysis and graphic presentation.