Red Hen Edge Search Engine

https://tvnews.sscnet.ucla.edu/edge/


Contents

Introduction

Front Page- Basic Search

Advanced Search

Display Format

Search Boxes

Search

Export

Regular Expression Mode

Browse

Regular Expression

Search Results

Video

Text

Montage

Image Flow

Metadata

Permalink

Bookmark

Exporting

Job List


Introduction

Various sites capture TV News from around the world. Closed captioning texts in English and other languages are digitalized, captured, and put in this search engine. The search engine can potentially include the text of on-screen text boxes (harvested through optical character recognition), the text of transcripts, and other aspects of the broadcast.

Front Page- Basic Search

A UCLA account is required to access this archive. (For researchers in the Distributed Little Red Hen Lab, one of the co-directors must approve your acquisition of a UCLA account and monitor your research.) Upon logging in, you will be taken to the main page. The basic search features can be accessed here. The search will show 10 programs/results per page.

You can also access Advanced Search and Browse from this page.

Advanced Search

The advanced search screen lets you control display format, specify networks, series, and date ranges through menus.

Display Format

There are three different display formats available: list, table, and chart.

    • List: Gives you a list of the programs that meet your search requirements.
    • Table: Lists a month's worth of results by date, number of programs that meet the search requirement, number of hits in those programs, and total programs for that day.
    • Chart: Displays the data of search results in charts. You can choose from area chart, bar chart, column chart, line chart, stepped area chart, or table. Each chart will display one of the following: occurrence counts (cumulative/non-cumulative) or news program counts (cumulative/non-cumulative).

Search Boxes

    • with all the words/phrases: contains all words/phrases.
    • with at least one word/phrase: contains at least one of the words/phrases.
    • without the words/phrases: does not contain the words/phrases.
    • with all the words/phrases within...: allows you to choose words/phrases that are near each other (ie. within same segment, within 5 words, within 10 words, etc.)


Search

The search button will lead you to the results page.

Export

The export button will lead you to the Job List page. This will allow one to export searches which you will be able to open in another program on your computer. You can continue searching while the export job is completing.

Searches are exported to csv (comma-separated values) files. They can be read by spreadsheet programs such as Excel, Numbers, and OpenOffice. If prompted, select line 6 to start the rows, the UTF-8 character set, comma as the field delimiter, and double quote as the text delimiter.

In the spreadsheet application, you can convert the URLs in the csv file into hyperlinks:

Numbers: triple-click on the link and press Control-K

Excel: see instructions for bulk conversion of URLs to hyperlinks

Click a hyperlink to call up that search result.

Regular Expression Mode

    • Fast: uses faster, indexed version of closed captioned files, with no punctuation included. Fast is case-insensitive. Fast version additionally has word place-holders (get example from existing docs).
    • Raw Text: uses raw, closed caption files, including punctuation. It is case-sensitive.
    • Both use mostly the same syntax.

Test: opens a new tab to the test page, which allows you to test a regex pattern.

Browse

The browse function will take you to the image flow set to today's date.

Regular Expression

Regular Expressions Guide

You may test a regex pattern through the test link found in the Advanced Search page.

Basic Syntax

A regular expression pattern can contain subpatterns separated by space. Each subpattern matches a consecutive word, and the pattern as a whole matches a phrase.

A regular expression takes the form of /subpattern1 subpattern2 ... subpatternN/

Each subpattern can contain the following:

String

.

[a-z]

|


?



{m,n}

Matches

any character (within a word)

a single character from a to z

any of the elements


the preceding element occurring at most once


the preceding element occurring at least m times and at most n times

Example



(he|she|it) matches any of the words "he", "she" or "it"


(en)?large matches the words "large" or "enlarge"


grea(2,4)t matches the words "greaat", "greaaat" or "greaaaat"

The regular expression search use the BRICS automaton package.

Regular expression syntax

Currently, please use lower case to enter a pattern.

Placeholders

Placeholders can be used as a whole subpattern, and in the place of a single or multiple words.

Placeholder

*

*+

*?

*{n}

*{m,}

*{,n}

*{m,n}

Matches

0 or more words

1 or more words

0 or 1 words

exactly n words

m or more words

up to n words

at least m words and up to n words

Please note that a placeholder must be used between non-placeholders. For example, the pattern /*+/does not match anything. Currently, leading and trailing placeholders are discarded. For example, /*+ of the *+/ behaves the same as /of the/.

More Examples

Pattern

/[A-Za-z]{10}/

/[0-9A-Za-z]{10}/

/[A-Za-z]{10,12} /

/[A-Za-z]{10,12}(,|.)/

/[A-Za-z-]{10,12}S/

/[A-Za-z-]{12,}(!|?|.|,|:|;)/

/G[A-Za-z-]{10,12} /

/ G[A-Za-z-]{10,12} /

/ A G[A-Za-z-]{10,12} /

/ A G[A-Za-z-]{10,} /

/IS NOW [A-Za-z-]{2,}ING /

/* is the * of */

/has .*ed now/

/vote for (obama|romney)/


Matches

words at least ten letters long

words 10 alphanumeric characters long

10 to 12 letters followed by a space

10-12 followed by a space, comma, or period

including hyphens, and end in S

12+ and then space or punctuation

G followed by 10-12 letters and then a space

same, but G is the first letter

same, preceded by the indefinite article "A"

same, but no maximum number of letters

"is now *ing"

"is the", followed by exactly one word, followed by "of"

"has", followed by a word that ends with "ed", followed by "now"

"vote for", followed by either "obama" or "romney"

Search Results

There are two times listed for each program. The first one is the local time when it was broadcasted. Although most of the programs are broadcasted in California, there are some that come internationally. Thus, the second time is in the form of UTC.

Video

Clicking on the video link or on the thumbnail will play the video in the player on the right. You can skip forward or backward by using the buttons in the video player.

Note: On a given search results page, when you click on a thumbnail to cue the video player you may notice a discrepancy between the thumbnail image and the frame that appears in the video player. There is a 10-second difference between the two. This is due to the closed captioning text, which always lags in timecode behind the actual video. This is not an error in the DCL's timecode structure. If you immediately click on the "skip ahead 10" link in the video player after loading the clip, you will see identical images.

Text

This will take you to the page with the transcription of the video.

Montage

Clicking on the montage link will lead you to thumbnails taken every 10 seconds of the show. Clicking on a thumbnail will start the video at that specific time of the show.

Metadata

Metadata is where you can find the closed captioning along with the corresponding time stamp. Clicking on the time will play the video at that specific moment in the show. You may also bookmark the video.

Permalink

The permalink will be able to take you to the page for linking the video. This bookmarks the video and will begin from the beginning of the show.

Bookmark

The bookmark will bring you to a page with the video which will start at a specific timecode of the show.

To bookmark a video: Simply right click (or CTRL + click if on a mac) the link "permalink" and select "Bookmark This Link." Type in a reference for the bookmark in the "name" field, then select "ok." The reference will now appear under the bookmarks menu. If using a browser other than Firefox, note that the particular steps for accessing the bookmark may differ slightly. Refer to your browser's "help" section if necessary. This bookmark will cue the video at the beginning of the clip. If you desire a bookmark for a specific time in the video, use the second option.

Bookmark a video at specific timecode: After playing the video and noting the desired timecode, click on the paper icon located at the end of each caption preview. This will open a page containing only that particular video clip. In the URL field at the top of your browser, change the last set of numbers to your desired timeocde. Note, you must convert the timecode into seconds and the number must be in ten second increments. Click enter to load the page for that specific timecode. Then select Bookmark from the browser menu bar. Select Bookmark this page. Fill in a reference in the "name" field.

Example:

Noted timecode is 15 minutes and 23 seconds into a given clip (923 total seconds).

After clicking on the paper icon, the following appears in the URI field: "http://dcl.sscnet.ucla.edu/search/video,20279,170".

Change the "170" to 920 since the timecode must be converted to seconds and be in ten second increments. The URI would now read: http://dcl.sscnet.ucla.edu/search/video,20279,920

Select bookmark from the browser menu bar.

Select Bookmark this page.

Fill in a reference in the "name" field.

Exporting

There are two options for exporting: "export this page" or "export all pages." Both will lead you to the Job List page.

Export this page: Exports the number of programs found on that page.

Export all pages: Exports all of the programs found based on that search.

Upon completion of the export, you may download it and open the file using another program such as Excel. The export is text-only.

Job List

The job list will give you the list of activity done by the user. This allows you to go back to previous searches and download exports. The job list can be viewed by clicking your name at the top right corner. Displays type, start/end, query, status, message, action of activity.

Type:

    • Describes whether the job is a search or export.

Start/End:

    • Gives the time and date of the activity.

Query:

    • Clicking on the query will take you to the advanced search screen with the same entries pre-filled.

Status:

    • Finished: Job has been completed.
    • Running: Job is still in the process of completion. You can cancel the job by clicking the "cancel" link to the right.
    • Cancelled: Job has been cancelled.
    • Queued: Job has not started running yet but will run later.
    • Error: Job has been aborted because of an error.

Message:

    • Describes the progress or results of the activity.

Action:

    • Export jobs will be able to be downloaded or deleted. Downloading an export will allow you to open the file on your computer using another program such as excel.
    • Viewing a search will take you to the results page.

Missing Files

Files rejected by the Edge import script are listed in tvnews:/data/tna/edge/solr/invalid_files.txt.

We should monitor this file. There are three common failure types:

  • the import script encountered an unfamiliar header tag (solution: modify the import script)
  • the duration of the video is missing (solution: run fixDUR on the file)
  • the video is missing (solution: none, disregard)