Guardian News & Media
GNM RCS
Content web feed
Technical specification
Prepared by O3 Team Limited
Authors Nigel Robson
Creation date 04/10/2013
Document Ref. GNM_RCS_Content_Web_Feed_TS.docx
Version draft for review
.Introduction
Purpose
The document GNM_RCS_Content_Processing_FS.docx is the functional specification that describes what business functions RCS supports in relation to processing published content.
This document is one of a set of technical specifications that provide details of how those functions are implemented in RCS.
Scope
This document focusses on the recording of web content in RCS. Separate documents deal with all other aspects of content processing in RCS, including print content processing, matching, specials, and AV products.
This document is intended as a high-level technical document outlining how the relevant business functions are implemented in terms of software modules.
Importantly, this document does not aim to provide the level of detail that would be required in a programming specification in areas such as program structure, detailed business rules, data integrity, validation, locking considerations, data security, and calls to/from other software modules, performance considerations, and so forth.
For details of program logic and coding, the reader should refer to the program files themselves.
.Website
GNM’s website, www.theguardian.com, publishes virtually everything that appears in print in The Guardian and The Observer, and also many other articles and images, as well as many blogs, galleries, polls, interactives, competitions, and audio-visual products.
All of the published web content should be rights managed by the Rights Department, and much of this is done through RCS.
A list of web-published content (but not the content itself) is stored in the TLIB schema, which is consistent with where a list of print content is stored. RCS extracts the data it requires from these lists in the TLIB schema.
Feed 1 (June 2009 – 2013)
The TLIB user processes an XML feed published by the website that identifies all the key components of each published item. This feed is currently found at this URL:
http://cms.guprod.gnl/tools/external/rcs/list?from=201310070000&to=201310071800
The URL accepts a from and a to date, and returns the content whose publication date and time is within the time period requested. The maximum period that can be requested is 1 day.
The software that gets the XML and parses it is this packaged procedure:
website_content_interface.get_latest_content
It uses standard Oracle built-in packages to get the XML, parse it, and extract what is needed.
An automatic process (database job) in TLIB requests in chronological order of the datetime it was published. It reads the publication date and time of the last item of content received from the website and then requests content from that point until the current time. Usually this will only be a few minutes as TLIB tries to stay fully up to date, but if more than a day needs to be caught up then multiple chronologically sequential requests will be made. The software has been written to ensure these extract jobs can never run in parallel.
RCS records the key data that it requires to manage the data, for each item.
XML format
Below are examples of the format of the XML for an article and an image:
Article example
<rcsContentsAndPictures>
<rcsContent>
<contentType>Article</contentType>
<section>UK news</section>
<title>Scottish independence: armed forces will be under-funded, says MoD</title>
<url>http://www.theguardian.com/uk-news/2013/oct/07/scottish-independence-armed-forces-funding</url>
<description>Report states 2.5bn proposed annual budget for Scottish Defence Force is too small and takes no account of startup costs</description>
<publicationName>The Guardian</publicationName>
<byline>Severin Carrell, Scotland correspondent</byline>
<firstPublishedOn>07/10/2013 00:01:01</firstPublishedOn>
<keywords>
<string>Scotland</string>
<string>Scottish National party (SNP)</string>
<string>Scottish independence</string>
<string>Military</string>
<string>Scottish politics</string>
<string>Politics</string>
<string>UK news</string>
</keywords>
<contributors>
<contributor>
<name>Severin Carrell</name>
<r2contribid>19388</r2contribid>
</contributor>
</contributors>
<wordCount>760</wordCount>
<pageNumber>6</pageNumber>
<cmsId>419188103</cmsId>
<storyBundleId>8161116</storyBundleId>
</rcsContent>
Image example
<rcsContentsAndPictures>
<rcsContent>
<rcsPicture>
<pictureContext>In Article Picture</pictureContext>
<caption>Gwalior fort. Photo: Alex Bellos / Gwalior fort</caption>
<url>http://static.guim.co.uk/sys-images/Guardian/Pix/pictures/2013/10/7/1381129339514/rsz_img_2486.jpg</url>
<source>Picture</source>
<height>256</height>
<width>460</width>
<parentUrl>http://www.theguardian.com/science/alexs-adventures-in-numberland/2013/oct/07/mathematics1</parentUrl>
<imageId>419223928</imageId>
<pictureId>419260601</pictureId>
</rcsPicture>
XML processing
The XML is processed entirely in PL/SQL using Oracle built-in packages. In particular these standard Oracle packages are used:
dbms_lock to take a session lock
utl_http to get the XML, in chunks, from the URL
dbms_lob to hold the concatenated chunks of XML in CLOB datatype
dbms_xmlparser to parse the XML in the local CLOB
dbms_xmldom to process the parsed XML
Parse errors
In the past a situation has occurred whereby RCS has been unable to parse the XML received. This was due to errors within the XML generated, that Oracle’s internal XML parser could not handle.
If this situation occurs the time period being extracted needs to be extracted manually, in smaller chunks of hours or just minutes, only avoiding the minutes when errors have occurred. Hopefully only a very few items published in a small timeframe will be omitted. The RCS administrator can test the URL for that timeframe in a browser – and when the offending item of content is found it can be reported to the web team.
Data load errors
Very occasionally an error is encountered with the data after the XML has been parsed. This maybe a field having a value longer than previously expected. To ensure RCS saves as many items of content as possible the extract process saves changes to the database after each item is processesd, and then skips errors moving on the the next item.
Manual load
For content that has been skipped and needs to be reloaded later there is a facility in RCS to request the re-extraction of the date on which it was published from the website, or just a single URL. The is done via this menu option:
Content → Extract web content. This open the Oracle Form called rcs_extract_010_pc.fmb
This extract gets all content for that day, or just the one URL, but when loading it into the TLIB schema it will check for items that already exists and will update them rather than duplicate them. This duplicate check uses the URL of the content, as this uniquely identifies it.
Database tables loaded
This extract process loads the data into the following TLIB tables:
WEBSITE_CONTENT
WEBSITE_CONTENT_KEYWORDS
WEBSITE_CONTENT_CONTRIBUTORS
WEBSITE_CONTENT_TAGS
Each entry in the WEBSITE_CONTENT table represents an item of content and is allocated a unique ID from the WCON_ID sequence.
If new sections of the website appear these are recorded in the WEBSITE_SECTIONS table and need to mapped to a Cost Centre by the RCS Administrator using the Website section mapping screen, rcs_wsec_010_pc.fmb, which is accessed from the menu option Housekeeping → Department setup → Website section mapping.
Keywords are also stored against the content, and details of contributors, and other tags that are supplied.
Feed 2 (2013/14 onwards)
A replacement data feed, using the API, is being worked on at the moment. This includes a richer set of data, but it is not in production yet as it omits a means of identifying images in Octopus (picture ids) which is needed to get IPTC header data, and most importantly the PicDar URN for the images.
(Getting the PicDar URN is crucial to enable RCS to link multiple uses of the same image e.g. a print image linked to both a web thumbnail and web in-article picture, as it is vital RCS only process and pay for each image once.)
There may be a transition period whereby this new content feed starts to be used for all content except images, whilst Feed 1 continues to be user for the images until such time as the API can supply all the data required.
The URL for this feed is:
The software that gets the data and parses it is a package procedure:
web_api_content_feed.get_content
XML format
Below are examples of the format of the XML for an article and an image. Many more formats are processed, but these examples are provided to demonstrate to extensive XML that needs to be parsed, as compared to the far simpler parsing in Feed 1.
Article example
<response status="ok" user-tier="internal" total="459" current-page="1" pages="46" start-index="1" page-size="10" order-by="newest">
- <results>
- <content web-url="http://www.theguardian.com/music/2013/jun/01/lou-reed-liver-transplant" section-id="music" web-title="Lou Reed recovering after liver transplant" api-url="http://content.guardianapis.com/music/2013/jun/01/lou-reed-liver-transplant" section-name="Music" web-publication-date="2013-05-31T23:57:04Z" id="music/2013/jun/01/lou-reed-liver-transplant">
- <fields>
<field name="newspaper-page-number">9</field>
<field name="trail-text"><p>Musician, 71, underwent life-saving surgery last month in Cleveland, says wife Laurie Anderson</p></field>
<field name="headline">Lou Reed recovering after liver transplant</field>
<field name="body"><p>Lou Reed, the US songwriter, poet and vocalist with the Velvet Underground, had a liver transplant last month, according to his wife, the musician and performance artist Laurie Anderson.</p><p>"It's as serious as it gets. He was dying. You don't get it for fun," said Anderson, who added that her husband was now on the road to recovery following the life-saving surgery.</p><p>Reed, 71, cancelled a number of concerts in April and had surgery in Cleveland rather than in his native New York due to what Anderson described as the "dysfunctional" hospitals in his home town. She <a href="http://www.thetimes.co.uk/tto/arts/music/article3778630.ece" title="">said in an interview with the Times</a>: "I don't think he'll ever totally recover from this, but he'll certainly be back to doing [things] in a few months. He's already working and doing t'ai chi. I'm very happy. It's a new life for him."</p><p>The couple, above, have been together for more than 20 years but got married in 2008 following a spur-of-the-moment decision while talking on the phone.</p><p>Anderson, whose 1981 single, O Superman, reached number two in the British charts, spoke of her awe for the operation which saved Reed's life.</p><p>"You send out two planes – one for the donor, one for the recipient – at the same time. You bring the donor in live, you take him off life support. It's a technological feat.</p><p>"I was completely awestruck. I find certain things about technology truly, deeply inspiring."</p><p>Reed surprised fans in New York in March when he appeared at a playback of his seminal album Transformer.</p><p>Best known as guitarist, vocalist, and principal songwriter of the Velvet Underground, Reed has also had a successful solo career spanning a number of decades, producing hits such as Walk on the Wild Side in 1972. Recent collaborations have included a 2011 album with the rock group Metallica.</p><p>Anderson is due to perform at the Barbican in London this month with the string ensemble, Kronos Quartet.</p><!-- Guardian Watermark: internal-code/content/409907216|2013-10-07T15:13:13Z|dc9c8cf25dcb3abf2b2183b31582e7e9bd8e2a7c --></field>
<field name="show-in-related-content">true</field>
<field name="last-modified">2013-06-01T08:58:37Z</field>
<field name="has-story-package">true</field>
<field name="score">1.0</field>
<field name="secure-thumbnail">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044544632/Lou-Reed-and-Laurie-Ander-003.jpg</field>
<field name="standfirst">Musician, 71, underwent life-saving surgery last month in Cleveland, says wife Laurie Anderson</field>
<field name="short-url">http://gu.com/p/3g99z</field>
<field name="thumbnail">http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044544632/Lou-Reed-and-Laurie-Ander-003.jpg</field>
<field name="wordcount">314</field>
<field name="commentable">false</field>
<field name="internal-content-code">409907216</field>
<field name="allow-ugc">false</field>
<field name="internal-octopus-code">7670831</field>
<field name="is-premoderated">false</field>
<field name="byline">Ben Quinn</field>
<field name="publication">The Guardian</field>
<field name="newspaper-edition-date">2013-06-01</field>
<field name="internal-page-code">1916079</field>
<field name="production-office">UK</field>
<field name="should-hide-adverts">false</field>
<field name="live-blogging-now">false</field>
<field name="comment-close-date">2013-06-03T23:57:04Z</field>
</fields>
- <tags>
- <tag web-url="http://www.theguardian.com/music/lou-reed" type="keyword" section-id="music" web-title="Lou Reed" api-url="http://content.guardianapis.com/music/lou-reed" section-name="Music" id="music/lou-reed">
- <references>
<reference id="musicbrainz/9d1ebcfe-4c15-4d18-95d3-d919898638a1" type="musicbrainz" />
</references>
</tag>
- <tag web-url="http://www.theguardian.com/music/music" type="keyword" section-id="music" web-title="Music" api-url="http://content.guardianapis.com/music/music" section-name="Music" id="music/music">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/world/usa" type="keyword" section-id="world" web-title="United States" api-url="http://content.guardianapis.com/world/usa" section-name="World news" id="world/usa">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/world/world" type="keyword" section-id="world" web-title="World news" api-url="http://content.guardianapis.com/world/world" section-name="World news" id="world/world">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/tone/news" type="tone" web-title="News" api-url="http://content.guardianapis.com/tone/news" id="tone/news">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/profile/benquinn" type="contributor" web-title="Ben Quinn" api-url="http://content.guardianapis.com/profile/benquinn" id="profile/benquinn" bio="<p>Ben Quinn is a news reporter for the Guardian</p>" r2-contributor-id="25906">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/theguardian/mainsection/uknews" type="newspaper-book-section" section-id="uk-news" web-title="UK news" api-url="http://content.guardianapis.com/theguardian/mainsection/uknews" section-name="UK news" id="theguardian/mainsection/uknews">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/theguardian/mainsection" type="newspaper-book" section-id="news" web-title="Main section" api-url="http://content.guardianapis.com/theguardian/mainsection" section-name="News" id="theguardian/mainsection">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/articles" type="type" web-title="Article" api-url="http://content.guardianapis.com/articles" id="type/article">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/theguardian/all" type="publication" section-id="theguardian" web-title="The Guardian" api-url="http://content.guardianapis.com/theguardian/all" section-name="From the Guardian" id="publication/theguardian">
<references />
</tag>
</tags>
<factboxes />
- <media-assets>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044541643/Lou-Reed-and-Laurie-Ander-001.jpg" index="1" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044541643/Lou-Reed-and-Laurie-Ander-001.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">54</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">54</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044543548/Lou-Reed-and-Laurie-Ander-002.jpg" index="2" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044543548/Lou-Reed-and-Laurie-Ander-002.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">130</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">140</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044544632/Lou-Reed-and-Laurie-Ander-003.jpg" index="3" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044544632/Lou-Reed-and-Laurie-Ander-003.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">84</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">140</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044545797/Lou-Reed-and-Laurie-Ander-004.jpg" index="4" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044545797/Lou-Reed-and-Laurie-Ander-004.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">132</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">220</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044547056/Lou-Reed-and-Laurie-Ander-005.jpg" index="5" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044547056/Lou-Reed-and-Laurie-Ander-005.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">168</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">280</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044548279/Lou-Reed-and-Laurie-Ander-006.jpg" index="6" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044548279/Lou-Reed-and-Laurie-Ander-006.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">180</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">300</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044549631/Lou-Reed-and-Laurie-Ander-007.jpg" index="7" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044549631/Lou-Reed-and-Laurie-Ander-007.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">228</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">380</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044550896/Lou-Reed-and-Laurie-Ander-008.jpg" index="8" rel="alt-size" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044550896/Lou-Reed-and-Laurie-Ander-008.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">276</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">460</field>
</fields>
</asset>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044550896/Lou-Reed-and-Laurie-Ander-008.jpg" index="1" rel="body" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/About/General/2013/6/1/1370044550896/Lou-Reed-and-Laurie-Ander-008.jpg</field>
<field name="source">Billy Farrell Agency / Rex Featu</field>
<field name="alt-text">Lou Reed and Laurie Anderson</field>
<field name="height">276</field>
<field name="credit">Billy Farrell Agency / Rex Featu</field>
<field name="caption">Lou Reed and Laurie Anderson pictured at a dinner in New York in 2011. Photograph: Billy Farrell Agency / Rex Featu</field>
<field name="width">460</field>
</fields>
</asset>
</media-assets>
<snippets />
- <references>
<reference id="musicbrainz/9d1ebcfe-4c15-4d18-95d3-d919898638a1" type="musicbrainz" />
</references>
</content>
Clearly this is far more complex than Feed 1: the XML describing the article is much more complicated itself; and the XML also includes a lot of information about other objects related to the article.
Image example
<content web-url="http://www.theguardian.com/music/cartoon/2013/jun/01/ronnie-wood-caricature" section-id="music" web-title="Ronnie Wood by Nicola Jennings" api-url="http://content.guardianapis.com/music/cartoon/2013/jun/01/ronnie-wood-caricature" section-name="Music" web-publication-date="2013-05-31T23:33:00Z" id="music/cartoon/2013/jun/01/ronnie-wood-caricature">
- <fields>
<field name="trail-text"><p>Rock star</p></field>
<field name="headline">Ronnie Wood</field>
<field name="show-in-related-content">true</field>
<field name="last-modified">2013-05-31T23:33:01Z</field>
<field name="has-story-package">false</field>
<field name="score">1.0</field>
<field name="secure-thumbnail">https://static-secure.guim.co.uk/sys-images/Guardian/Pix/cartoons/2013/5/30/1369934401992/Ronnie-Wood-by-Nicola-Jen-002.jpg</field>
<field name="short-url">http://gu.com/p/3g8eh</field>
<field name="thumbnail">http://static.guim.co.uk/sys-images/Guardian/Pix/cartoons/2013/5/30/1369934401992/Ronnie-Wood-by-Nicola-Jen-002.jpg</field>
<field name="commentable">false</field>
<field name="internal-content-code">409815809</field>
<field name="allow-ugc">false</field>
<field name="is-premoderated">false</field>
<field name="byline">Nicola Jennings</field>
<field name="publication">theguardian.com</field>
<field name="internal-page-code">1915478</field>
<field name="production-office">UK</field>
<field name="should-hide-adverts">false</field>
<field name="live-blogging-now">false</field>
<field name="comment-close-date">2013-06-03T23:33:01Z</field>
</fields>
- <tags>
- <tag web-url="http://www.theguardian.com/music/ronnie-wood" type="keyword" section-id="music" web-title="Ronnie Wood" api-url="http://content.guardianapis.com/music/ronnie-wood" section-name="Music" id="music/ronnie-wood">
- <references>
<reference id="musicbrainz/92ed8183-8f22-42b2-af4e-d44137610fa0" type="musicbrainz" />
</references>
</tag>
- <tag web-url="http://www.theguardian.com/uk/series/nicola-jennings-caricatures" type="series" section-id="uk-news" web-title="Nicola Jennings's caricatures" api-url="http://content.guardianapis.com/uk/series/nicola-jennings-caricatures" section-name="UK news" id="uk/series/nicola-jennings-caricatures">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/music/therollingstones" type="keyword" section-id="music" web-title="The Rolling Stones" api-url="http://content.guardianapis.com/music/therollingstones" section-name="Music" id="music/therollingstones">
- <references>
<reference id="musicbrainz/b071f9fa-14b0-4217-8e97-eb41da73f598" type="musicbrainz" />
</references>
</tag>
- <tag web-url="http://www.theguardian.com/music/popandrock" type="keyword" section-id="music" web-title="Pop and rock" api-url="http://content.guardianapis.com/music/popandrock" section-name="Music" id="music/popandrock">
- <references>
<reference id="musicbrainzgenre/adult-contemporary" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/art-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/ballad" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/classic-pop-and-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/classic-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/garage-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/goth-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/gothic-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/grunge" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/hard-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/other-pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop-and-chart" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop-rap" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/post-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/progressive-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock-and-roll" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock-pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock-roll" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rockabilly" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/schlager" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/soft-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/southern-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/synth-pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/synthpop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/top-40" type="musicbrainzgenre" />
</references>
</tag>
- <tag web-url="http://www.theguardian.com/profile/nicola-jennings" type="contributor" web-title="Nicola Jennings" api-url="http://content.guardianapis.com/profile/nicola-jennings" id="profile/nicola-jennings" bio="<p>Nicola Jennings originally trained as a theatre designer and started work designing for opera. She began caricaturing for the London Daily News in 1987, went on to work for the Daily Mirror and the Observer, and can now be seen regularly in the Guardian. She has also produced animated cartoons for Channel 4's A Week in Politics and drawn live on BBC2's Midnight Hour </p>" byline-image-url="http://static.guim.co.uk/sys-images/Guardian/Pix/pictures/2008/11/17/nicola_jennings_140x140.jpg" r2-contributor-id="28915">
<references />
</tag>
- <tag web-url="http://www.theguardian.com/cartoons/archive" type="type" web-title="Cartoon" api-url="http://content.guardianapis.com/cartoons/archive" id="type/cartoon">
<references />
</tag>
</tags>
<factboxes />
- <media-assets>
- <asset file="http://static.guim.co.uk/sys-images/Guardian/Pix/cartoons/2013/5/30/1369934399775/Ronnie-Wood-by-Nicola-Jen-001.jpg" index="1" rel="body" type="picture">
- <fields>
<field name="secure-file">https://static-secure.guim.co.uk/sys-images/Guardian/Pix/cartoons/2013/5/30/1369934399775/Ronnie-Wood-by-Nicola-Jen-001.jpg</field>
<field name="source">Guardian</field>
<field name="photographer">Nicola Jennings</field>
<field name="alt-text">Ronnie Wood by Nicola Jennings</field>
<field name="height">296</field>
<field name="credit">Nicola Jennings/Guardian</field>
<field name="caption">Ronnie Wood</field>
<field name="width">220</field>
</fields>
</asset>
</media-assets>
<snippets />
- <references>
<reference id="musicbrainzgenre/goth-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/grunge" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/post-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/synth-pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/adult-contemporary" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/synthpop" type="musicbrainzgenre" />
<reference id="musicbrainz/92ed8183-8f22-42b2-af4e-d44137610fa0" type="musicbrainz" />
<reference id="musicbrainzgenre/rock-pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/classic-pop-and-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/ballad" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/soft-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/progressive-rock" type="musicbrainzgenre" />
<reference id="musicbrainz/b071f9fa-14b0-4217-8e97-eb41da73f598" type="musicbrainz" />
<reference id="musicbrainzgenre/hard-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/schlager" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock-roll" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rockabilly" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/rock-and-roll" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/garage-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/gothic-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop-and-chart" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/southern-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/art-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/classic-rock" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/other-pop" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/pop-rap" type="musicbrainzgenre" />
<reference id="musicbrainzgenre/top-40" type="musicbrainzgenre" />
</references>
</content>
XML processing
The XML is processed entirely in PL/SQL using Oracle built-in packages. In particular these standard Oracle packages are used:
dbms_lock to take a session lock
utl_http to get the XML, in chunks, from the URL
dbms_lob to hold the concatenated chunks of XML in CLOB datatype
dbms_xmlparser to parse the XML in the local CLOB
dbms_xmldom to process the parsed XML
Error processing
As with Feed 1, this process is designed in such a way as to minimize the content not loaded due to errors with the feed.
Database tables loaded
This extract process loads the data into the following TLIB tables, which are necessarily different to the tables used by Feed 1.
WEB_API_CONTENT
WEB_API_CONTENT_ASSETS
WEB_API_CONTENT_ASSET_FIELDS
WEB_API_CONTENT_ELEMS
WEB_API_CONTENT_ELEM_ASSETS
WEB_API_CONTENT_FIELDS
WEB_API_CONTENT_TAGS
WEB_API_FIELDS
Each entry in the WEBSITE_API_CONTENT table represents an item of content and is allocated a unique ID from the WCON_ID sequence. NB The same sequence is used as for Feed 1 to ensure each item of web content is uniquely identifiable regardless of which feed it came from.
.Extraction of new content
RCS extracts i.e. copies metadata from the list of web content in the Text Library into RCS. RCS thereby maintains a list of published content, but not the content itself.
Database jobs
RCS has a number of database jobs mtrl_extract_website_content.extract_content that run in the background looking for content to extract from TLIB into the RCS schema. These jobs run for a specified date range.
Manual entry of content
Very rarely it is necessary for new content to be created manually. The entry of new content is done in the Matching history screen. This screen is the subject of a separate document in this documentation set.
Past and future content
The content extraction can only get content from the past.
However this may change in the future as there is a suggestion that RCS could receive complete but as yet unpublished content so that the rights management process can start earlier.
Uniqueness
Each item of content is represented by a single record in the MATERIAL table in RCS, and is assigned a unique ID from the sequence MTRL_ID.
The ID from the WEB_CONTENT and WEB_API_CONTENT tables are also stored against this record in column WCON_ID, and this ensures only one copy of the content is held within RCS. NB for manually entered content there is no WCON_ID and so to maintain uniqueness the negative of the record ID is populated in the WCON_ID column.
Identifying contributors
Stories and images will be received with by-lines and photographer names, but the most reliable source of the contributor is the contributor tag(s). In some cases the tag includes the RCS supplier ID, but in other cases just the R2ContribID. Where possible the RCS Administrator attempts to match up R2ContribIDs with RCS supplier IDs.
Automated processing
As soon as an item of content arrives in RCS, and whenever key data changes on unprocessed content, a rights profile is assessed, and an attempt is made to try to automatically process the items based on rules held within the system. Both of these processes are described in separate documentation.
.Sibling content
When the same item of content is published in more than one publication it is important that RCS is able to associate the two. These instances are referred to as siblings.
Linking the two items ensures that only one payment is made, and it also means that when one item has been processed, all sibling items can treated in the same way and removed from the unprocessed queue.
Text bundleids
Text items have a unique identifier, known as a bundleid, which is assigned to the item during the production process. Wherever the item is published the bundleid should stay with it, and so an association can be made when multiple instances of the same item are published.
Picture PicDar URNS
Pictures should be stored in the PicDar picture library prior to being put on print or web pages. Once in PicDar each picture is given a unique reference. Provided the reference stays with the picture wherever it is published it is possible for RCS to identify multiple uses of the same image e.g. appearing in print, being used as a thumbnail on the website, and also being used as an in-article picture on the website.
End of Document
<enter keywords here>
Keywords (or tags) are important to provide accurate search results. They are vital if you have attached rather than pasted content to this page.