Guardian News & Media
GNM RCS
Content print feed
Technical specification
Prepared by O3 Team Limited
Authors Nigel Robson
Creation date 04/10/2013
Document Ref. GNM_RCS_Content_Print_Feed_TS.docx
Version draft for review
.Introduction
Purpose
The document GNM_RCS_Content_Processing_FS.docx is the functional specification that describes what business functions RCS supports in relation to processing published content.
This document is one of a set of technical specifications that provide details of how those functions are implemented in RCS.
Scope
This document focusses on the recording of print content in RCS. Separate documents deal with all other aspects of content processing in RCS, including web content processing, matching, specials, and AV products.
This document is intended as a high-level technical document outlining how the relevant business functions are implemented in terms of software modules.
Importantly, this document does not aim to provide the level of detail that would be required in a programming specification in areas such as program structure, detailed business rules, data integrity, validation, locking considerations, data security, and calls to/from other software modules, performance considerations, and so forth.
For details of program logic and coding, the reader should refer to the program files themselves.
.Text archive
The text archive schema and the RCS schema exist in the same database.
The text archive stores a record of every item published in GNM print media in a relational database format, and also content from other national newspaper publishers.
The overall data structure, somewhat simplified, is as follows:
Focussing on the GNM publications only, the page data identifies both the Book the content appears in e.g. G1 or G2, and the editorial department e.g. “spo” for a Sports page with G1.
The articles include both stories and images: the latter including photographs, graphics, and cartoons.
The elements are the components of the items on the pages e.g. the by-line, standfirst, first paragraph, emboldened word, and so on. Any change of style on the printed page is identified by a separate element. Where an article runs over multiple pages the page it is associated with is the first page on which it appears.
Importantly the above structure is not suitable for web content, where the concept of editions does not exist. Web content is processed entirely separately.
The data that is stored in the Text library is an extraction of the mark-up language that exists within the PDF pages prepared for the print sites. Perl scripts parse this mark-up language and write it to the database via a PL/SQL API.
.Extraction of new content
RCS extracts i.e. copies metadata from the Text library into RCS. RCS thereby maintains a list of published content, but not the content itself.
Deduplication
As new content is stored in the Text archive it is flagged as being ready to be extracted by RCS. If the same story or picture gets published in a subsequent page version then the previous version will be marked as superseded. RCS will only ever extract a single instance of each item published on a given day in a particular publication.
Database jobs
RCS has a number of database jobs mtrl_extract.extract_dates_from_archive that run in the background looking for content to extract from TLIB into the RCS database. These jobs can run for a specified date range, or for a particular format of content.
Manual extraction of content
There is also a menu option in RCS which the RCS administrator can use to force a date range to be extracted (again): Content → Extract print content
This menu option opens an Oracle Form named rcs_extact_010_pc.fmb
Manual entry of content
Very rarely it is necessary for new content to be created manually – for example where the process that loads a page into RCS has combined two articles into one or multiple images into one. If the printed items need to be processed separately then RCS needs to have separate items of content.
The entry of new content is done in the Matching history screen. This screen is the subject of a separate document in this documentation set.
Past and future content
The content extraction can get content from the past and also content due to be published in the future, provided the pages have already been prepared and finalised e.g. sections of the next day’s paper that have been completed.
This extraction process only copies forthcoming image data into RCS: text is ignored until the library (Research & Information department) have processed the content, which they always do on the day of publication or soon thereafter.
Uniqueness
Each item of content is represented by a single record in the MATERIAL table in RCS, and is assigned a unique ID from the sequence MTRL_ID.
The Text Library unique (root) element ID is also stored against this record, and this ensures only one copy of the content is held within RCS. NB for manually entered content there is no Element ID and so to maintain uniqueness the negative of the record ID is populated in the ELEM_ID column.
If two instance of the same item are extracted from the Text Library, each with a different element ID, then RCS will perform checks to prevent duplication. For example if two pictures have the same department, page, name, and position then the images are assumed to be the same.
Image sizes
If more than one instance of a picture has appeared with different sizes in a day’s paper RCS will identify the largest published size as this is the size that should be paid for should the fee be based on space rates.
Identifying contributors
Often text content appears without a valid contributor name – this is one of the jobs the library team fulfils. A process exists in the Text library to try to identify the contributor name from the individual elements that make up the story, to help speed the process up.
Automated processing
As soon as an item of content arrives in RCS, and whenever key data changes on unprocessed content, a rights profile is assessed, and an attempt is made to try to automatically process the items based on rules held within the system. Both of these processes are described in separate documentation.
.Sibling content
When the same item of content is published in more than one publication it is important that RCS is able to associate the two. These instances are referred to as siblings.
Linking the two items ensures that only one payment is made, and it also means that when one item has been processed, all sibling items can treated in the same way and removed from the unprocessed queue.
Text bundleids
Text items have a unique identifier, known as a bundleid, which is assigned to the item during the production process. Wherever the item is published the bundleid should stay with it, and so an association can be made when multiple instances of the same item are published.
Picture PicDar URNS
Pictures should be stored in the PicDar picture library prior to being put on print or web pages. Once in PicDar each picture is given a unique reference. Provided the reference stays with the picture wherever it is published it is possible for RCS to identify multiple uses of the same image e.g. appearing in print, being used as a thumbnail on the website, and also being used as an in-article picture on the website.
End of Document
<enter keywords here>
Keywords (or tags) are important to provide accurate search results. They are vital if you have attached rather than pasted content to this page.