GNM_RCS_Matching_TS

Guardian News & Media

GNM RCS

Matching

Technical specification

Prepared by O3 Team Limited

Authors Nigel Robson

Creation date 11/10/2013

Document Ref. GNM_RCS_Matching_TS.docx

Version draft for review

1. .Introduction
  1. Purpose

The document GNM_RCS_Content_Processing_FS.docx is the functional specification that describes what business functions RCS supports in relation to processing published content.

This document is one of a set of technical specifications that provide details of how those functions are implemented in RCS.

1. Scope

This document focusses on the processing that is specific to Matching content in RCS. Separate documents deal with all other aspects of content processing in RCS, including print & web content processing, specials, and AV products.

This document is intended as a high-level technical document outlining how the relevant business functions are implemented in terms of software modules.

Importantly, this document does not aim to provide the level of detail that would be required in a programming specification in areas such as program structure, detailed business rules, data integrity, validation, locking considerations, data security, and calls to/from other software modules, performance considerations, and so forth.

For details of program logic and coding, the reader should refer to the program files themselves.

1. .Matching overview

The term “matching”, as used in RCS and by the Rights department, means processing content in any of the following three ways:

1. Matching content to a supplier agreement (commission or contract);
2. Marking content as being staff produced content i.e. GNM copyright; and
3. Disregarding content as being not relevant in terms of rights or payment processing.

“Unmatched” content is really unprocessed content.

Many hundreds of items of content are published every day in both print and web format, and all of this has to be processed. Much of this content is published in more than one place i.e. print content appears on the web, and vice versa; whilst some is original and only published in one place.

RCS processes as much of this content as it can using rule-based automated processes. These processes can disregard content, staff it, or match it to subscription contracts.

Where an item of content is clearly the same as an item already published in another GNM publication RCS replicates the processing performed on the original item. The only exception is where the same picture is re-published at an unrelated time – this may require a separate payment for an additional usage.

All content that remains unprocessed is kept in a queue to be reprocessed in batch and/or to be manually processed by Desk administrators in either of the two matching screens. In these screens administrative users see lists of content they are responsible for: they then need to make the appropriate decisions about the content: matching it, disregarding it, or staffing it.

1. .Automated matching

RCS attempts to match content in two ways:

- As soon as it first arrives in RCS; and
- In batch jobs that process large amounts of unmatched content.

It is processed immediately upon arrival in RCS as this means the content can be removed immediately from queue of content presented to desk administrators, and it ensures an accurate rights profile is determined as soon as possible.

The subsequent batch processes remain important as they may make matches that were previously not possible e.g. if a contract needed to be renewed, or approved, or a matching rule on a contract was wrong, or a disregard rule needed to be setup.

The automated matching process consists of many processes, and these are often changed, and sometimes reordered as the nature of the content changes. The list below gives some idea of the processes that are currently applied to each item of content as it is processed. However the software itself should be examined to see the current process at any time.

The packaged procedure that processes each item, whether called from a batch job or a trigger on the MATERIAL table, is called content_processing.match_item

NB Automated matching to commissions is not attempted. This is because it it too prone to error, for example:

- the contributor may have more than one unfulufilled commission;
- the contributor’s commission may not have been created on the system prior to publication;
- the relevant commission may have been matched already, but have been for multiple items of content.

The processing as it currently stands is as follows:

1. Phase 1: Check for some obvious reasons to staff content
2. Staff 'GC News' (Kable staff content)
3. Staff birthday stories since 1st April 2013
4. Staff LiveBlogs if the contributor name is not that of a freelance
5. Phase 2: Check for reasons to not try to match this item of content
6. Check if content has multiple contributor tags (suggesting co-authorship)
7. Check if a sibling image has multiple contributor tags
8. Check if a sibling story has multiple contributor tags
9. Phase 3: Test against housekeeping rules
10. Picture URL rules
11. Picture Copyright rules
12. Picture Source rules
13. Picture Headline/Caption rules
14. Cartoon description rules
15. Story byline rules
16. Phase 4: For Valid PicDar images see if it has been processed before

These checks look for a valid PicDar reference in the MATERIAL table columns PICTURE_PICDAR_URN and IPTC_PICTURE_SUPPLIER_REF:

1. if previously disregarded, then disregard again
2. if previously matched then replicate the match (or to contract for the relevant date)
  1. Exact match on Picture PicDar URN
  2. Exact match on IPTC Picture Supplier Ref to previously matched Picture PicDar URN
  3. Exact match on IPTC Picture Supplier Ref to previously matched IPTC Picture Supplier Ref
  4. Finally if the data starts with what looks like a valid PicDar URN try that
3. Phase 5: Try to match web Sudoku and Kakura puzzles

If there is just one relevant contract, and no unfulfilled commissions, then a match can be made.

1. Phase 6: Process embedded byline in website content

This step attempts to deal with Content Network items prior to the tag and contributor tests that follow.

1. Phase 7: Match (staff or contract): use web contributor tags

The data in the contributor tags are processed in this order of priority

1. Supplier ID
2. R2 contributor ID
3. Name
4. Phase 8 - Match (staff or contract) using other contributor data

This step uses data not already processed i.e. everything that could be a contributor name, but ignoring the website contributor tags (which exist for text only) as they have already been tested in a previous step.

Process available data in the following order of priority (this applies mainly for pictures):

1. Copyright
2. Source
3. Provider
4. Contributor
5. Photographer
6. Embedded URL found within Byline field

1. .Content matching screen

The matching screen is a complex screen that makes it possible to associate one or more items of content with one of more commissions or contracts, or the content can be staffed or disregarded.

The Content matching screen, called rcs_comm_040_pc.fmb, is accessed from this menu option: Content → Content matching

The screen is slower than most to open, but it will open with many blocks of data already queried.

The publication and departments are shown at the top and can be scrolled.

1. Unmatched published content

The unmatched content is shown on the left hand side of the screen. Only one format category is show at a time e.g. Text, or Pictures etc.

The data fields shown on the screen vary dependent on both the content format and the publication.

e.g.1 A picture will have an Area field whilst text has a word count

e.g.2 Print content has a book and page number whilst web content has a URL

The content data is held in the MATERIAL table and is flagged as being unmatched/unprocessed if the MATCHED_YN column is ‘N’ and RTDC_ID column (ID of the reason to disregard content) is null.

A series of function based indexes exist on this table whose columns provide various values for the unmatched content e.g. the format, publication, department, and publication date, which make queries faster in the screens and elsewhere.

1. Viewing content

The “View” button will show the user the content they are processing.

For print content it will show a PDF of either the printed page, or the item on the page – the user can choose. This feature is not working at the time of writing and has been referred to ESD.

For web content the relevant page on the website is launched.

1. PicDar

The “PicDar” button should invoke an API whereby the salient details of a picture are shown in a browser window.

This feature is not working at the time of writing and has been referred to ESD.

1. Diary

Pressing the “Diary” button opens the Picture Diary screen. This button is only available when looking at picture content

1. Convert

The “Convert” button allows a user to convert a picture into an illustration or vice versa. It is only available when image data has been queried on screen.

1. Commissions

The screenshot above shows how the matching screen looks when matching content to commissions – the commissions tab on the right hand side of the screen is in focus.

1. Restricting the list of commissions

The use can restrict the list of commissions displayed by:

1. Choosing the first letter of the contributor’s name in the alphabetic radio group on the right hand side; or
2. Choosing to show either Fulfilled but unmatched commissions, or Unfulfilled commissions, using the radio group at the bottom of the tab.
3. Matching on screen

When the use presses the “Match” button the highlighted content is matched to the highlighted commission.

On the database this writes a record to the COMMISSION_MATERIAL table, sets the MATCHED_YN column on the MATERIAL table to ‘Y’, and sets FULFILLED_YN to ‘Y’ on the COMMISSIONS record if it is currently ‘N’. The content and the commission are then cleared from the matching screen and the transaction is automatically saved.

1. Fulfilling & spiking commissions

The user also has the option to manually fulfil the commission or to spike it. Fulfilling a commission idicates that the content has been received but not yet published. Spiking a commission indicates the submitted content is not going to be published at all. Both processes trigger payment.

1. Find a commission and match to it

If the commission that needs to be matched to is not visible on screen the user can use the Find facility. This takes the user through a series of Lists of Values allowing them to identify the commission that the content should be matched to.

1. Create a commission and match to it

If the commission for the published content has not yet been created the user can initiate the process of creating a new commission using data relating to the published content. This facility will navigate the user to the Commissioning platform, passing through values that can be used to pre-populate the commission record. The Commissioning platform is described in a separate document.

1. Contracts

The Contracts tab lists subscription contracts i.e. contracts where the annual fee is fixed, but not those contracts where either lineage or space rate payments apply. The user can re-query the list to show current contracts or chosoe to see only those in force at the time the content was published.

If the contract that needs to be matched to is not listed then the user can press the “Find…” a contract button and thn identify the contract by navigating through a series of Lists of Values.

Users need to be careful around the start and end of contracts – content may have been submitted before a contract ended, but published after it ended, in which case it still needs to be matched to that contract and not a renewal of the contract.

1. Lineage

Some contracts for text are agreed on a pay-as-you-go basis whereby the contributor is paid a fee calculated on the basis of a rate per thousand words. Again these contracts are listed in order of contributor name, as shown below:

If the user presses the “Match” button whilst a lineage contract is highlight they will be presented with the window below, in which they must complete the details of the payment including the Chart of Accounts details. If necessary they can adjust how many words are being paid for (as the delivered word count and published word count may differ significantly).

Instead of pressing “Match” the user can press the “Create…” part-lineage button to make a one-off payment to the contributor, and this will not be linked to the highlighted content. If they do this they see the same window in which they must complete the payment details: but this time they must specify the words, the description of the payment, and the Chart of Accounts details.

In either of these situations pressing “Confirm” confirms the payment and saves it to the database.

1. Agreed (space) rates

GNM has contracts with many picture libraries and archives whereby payments are made for each picture use and payments are made according to size of the published image.

The Agreed rates tab lists these supplier contracts by contributor name. The user can re-query the list to show current contracts or those in force at the time the content was published.

Pressing “Match” will create a payment wioth the cost automatically calculated based on the price bands agreed against the contract.

By pressing the “Create…” pix payment button, shown above, the user can enter a one-off payment to the highlighted contributor in the following window:

1. Disregarding

Content can be disregarded for many reasons. Each disregard reason has its own set of rights that will be applied to the content and no fee will be payable.

1. Staffing

Content produced by full-time staff, or part-time staff during their contracted hours, belongs to GNM. This content is marked as “Staff”. Internally this involves matching the content to a hidden record in the COMMISSIONS table that has a reference “G”.

Usually content is staffed when the contributor name is recognisable as a staffer. If the content has been co-authored it may need to be both staffed and matched to a freelance arrangement (c.f. later section).

Content that has a by-line/credit that is the name of a part-time staffer may need to be manually processed as the automated jobs cannot determine whether the content was produced during their contracted hours or not.

1. By-line pictures

The user may realise a picture is a by-line picture. These images not follow the normal process, as freelance photographers often take these pictures but GNM retains the copyright in them.

By pressing the “By-line…” button the picture is marked as a staff picture and it is also added to a list of know by-line pictures. This ensures that the next time the image is used RCS can automatically mark it as staff without any user intervention.

1. Multi-matching

Often an article may be written by more than one person – co-authored. In these cases the content needs to remain on screen after each match so that subsequent matches can be made. An item may be matched to any combination of commissions, contracts, and staffing.

A commission may also be created for multiple items of content, in which case it needs to remain on screen after first being matched in order to match it to other items of content.

Top support both of these requirements to “multi-match” the screen has a multi-match mode. The user can press the button in the centre of the screen with a green traffic light icon and labelled “Start multi-match” and this puts the screen into the new mode:

To avoid errors and to ensure the user is reminded the screen is in multi-match mode some of the screen’s visual characteristics change:

- A message appears in bold red in the top left of the screen: “Multi-match mode”;
- The current record highlighting for the content, commissions, contracts, agreed rates, and lineage changes from the standard green to yellow;
- The disregard reasons are removed as content cannot be both matched and disregarded;
- The button with the green traffic light icon changes to a red traffic light icon with the label “Stop multi-match”; and
- For text content a box appears that allows the user to say how many words are being attributed to a given match.

In multi-match mode when the user presses the “Match” button the processing described earlier in this document still takes place. The only difference is that the data is not cleared from the screen, thus enabling additional matches to be made.

1. Complex screen processing

A few techniques are used in this screen that are more complex that usual, and worthy of further note:

1. Fast access to queued data

To make it possible to open the screens relativelym quickly the lists of unprocessed content, unfulfilled commissions, relevant contracts etc. (for the context of the queried editorial department, format and for the currenmt user’s permissions) extra indexes have been added. In most cases these are function based indexes that:

- Are constructed so as to only have index entries for the relevant data; and
- Index that list data in the order it is to be displayed on screen, reducing sort times.

1. Views as base tables

The view LINEAGE_CONTRIBUTOR_DEPTS shows lineage contributors and their contracts and lineage rates. Using a view makes it possible to join the CONTENT_PROVIDERS table to the AGREEMENTS table and also the AGREEMENT_DEPARTMENT_MATCHING table in a single query rather than having AGREEMENTS as the base table with sub-query and POST-QUERY access to the other two tables (which would be less efficient).

1. Non-base table inserts

When an item of content is matched to a COMMISSIONS or AGREEMENTS (Contract) record a new record is created in the COMMISSION_MATERIAL table, the AGREEMENT_MATERIAL table, or the LINEAGE table. These latter tables do not exist as base table blocks in the Oracle Form, so to ensure the records can be written to the database within the integrity of an Oracle Forms commit unit the inserts are performed on the PRE-UPDATE trigger on the MTRL block (which is based on the MATERIAL table).

The matched record in the MTRL block will always be updated during a match, as the flag MATCH_YN will be set ‘Y’, and consequently the PRE-UPDATE trigger will always fire. In multi-match mode even if the value is already set to ‘Y’ it gets overwritten, just to ensure an update is processed, and thereby ensuring the PRE-UPDATE will still fire.

1. .Content matching by cost centre screen

The Content matching by cost centre screen is very similar to the Content matching screen. The only significant difference is that instead of the context at the top of the screen being the publication and department, in this screen the context is the cost centre paying for the content.

The detail of the matching process described in the previous section applies to this screen. The screenshot below is provided to highlight the differences in the screen layouts.

This screen has a number of features which are not available in the main content matching screen:

- The user can show content and contributor agreements across all cost centres, rather than one a t a time;
- The user can see web content or print content only;
- The user can query a specific tone, series, blog or section;
- When scrolling to a new item of content it disappears from the screen if alrady matched by another user, or as a sibling of the current user’s matches;
- In multi-match mode (where matches are not removed from the screen) matched content is highlighted;
- In the commissions tab the user can request to see all commissions for a contributor, by click on the “one contributor” radio button, or by double clicking on the contributor name field. This should make the “Find a commission” button obsolete as it was so much slower;
- At the top of the screen the user will only be shown the format they process.

1. Matching crosswords

This screen has also been further developed to behave differently in different modes:

If opened from the menu option Content → Match crosswords open the screen without the cost centres zone, and only showing crosswords:

The user can further restrict what is shown by web or print publications.

1. Matching book extracts

If opened from the menu option Content → Match book extracts open the screen without the cost centres zone, and only showing book extracts (published on the website):

1. .Priory matching review

Mainstream publishes GNM content in accordance with rights information published by RCS. In the case of certain priority content (currently interviews, book extracts and the long reads) the rights information is only published once the content processing in RCS (i.e. matching or disregarding) has been approved.

A separate queue of this content is maintained for approval by the Rights department. This queue is accessed from the menu option: Content → Priority matching to review

The Rights department user approves the processing by checking the “Cleared?” checkbox on the right handside of the screen. This then trigger the generation of, and publication of, rights tags for the web systems to consume and thereby potentially add the content to the CAPI feed.

End of Document

Keywords (or tags) are important to provide accurate search results. They are vital if you have attached rather than pasted content to this page.

Page updated

Report abuse