XML

HTML

Link:

http://www.w3schools.com/html/html_intro.asp

* HTML stands for Hyper Text Markup Language

* An HTML file is a text file containing small markup tags

* The markup tags tell the Web browser how to display the page

* An HTML file must have an htm or html file extension

* An HTML file can be created using a simple text editor

from http://www.w3schools.com/html/html_intro.asp

Here's an example :

<html>

<head>

<title>Web Page</title>

</head>

<body>

<strong>Hello</strong>, this is a web page!

<hr/>

<font color=red>This is some red text</font>

</body>

</html>

XML

Links:

http://www.w3schools.com/xml/xml_whatis.asp

* XML stands for EXtensible Markup Language

* XML is a markup language much like HTML

* XML was designed to describe data

* XML tags are not predefined. You must define your own tags

* XML uses a Document Type Definition (DTD) or an XML Schema to describe the data

* XML with a DTD or XML Schema is designed to be self-descriptive

* XML is a W3C Recommendation

from http://www.w3schools.com/xml/xml_whatis.asp

If you use an RSS reader to access blogs or other newsfeeds, you use XML. Really Simple Syndication produces XML-based feeds summarizing frequently updated content.

Here's a real example from nytimes.com:

<?xml version="1.0" encoding="UTF-8"?>

<rss version="2.0">

<channel>

<title>NYT > Sunday Book Review</title>

<link>http://www.nytimes.com/pages/books/review/index.html?partner=rssnyt</link>

<description></description>

<language>en-us</language>

<copyright>Copyright 2007 The New York Times Company</copyright>

<lastBuildDate>Fri, 24 Aug 2007 20:05:02 GMT</lastBuildDate>

<image>

<title>NYT > Sunday Book Review</title>

<url>http://graphics.nytimes.com/images/section/NytSectionHeader.gif</url>

<link>http://www.nytimes.com/pages/books/review/index.html</link>

</image>

<item>

<title>On the Road Again</title>

<link>http://www.nytimes.com/2007/08/19/books/review/Sante2-t-1.html?ex=1345176000&amp;en=b8402a9d3d6e4457&amp;ei=5088&amp;partner=rssnyt&amp;emc=rss</link>

<description>The novel that &#8220;On the Road&#8221; became was inarguably the book that young people needed in 1957, but the sparse and unassuming scroll is the living version for our time.</description>

<author>LUC SANTE</author>

<guid isPermaLink="false">http://www.nytimes.com/2007/08/19/books/review/Sante2-t-1.html</guid>

<pubDate>Sun, 19 Aug 2007 02:56:43 GMT</pubDate>

</item>

</channel>

</rss>

Elements

Elements are surrounded by tags. Tags come in pairs. The open tag identifies the beginning of the element and the closing tag, denoted by the / before the tag name, identifies the end of the element. Tag names are genreally fairly intuitive descriptions of the data that will be contained in the element. For example, as you might expect, the text between the author tags is the name of an author. In some cases, you may find empty elements that look as follows: <description/>.

Elements may contain text, other elements, and attributes (discussed below). In the example above, the rss element contains one element, channel. The channel element contains elements title, link, description, language, copyright, lastBuildData, image, and item.

Elements form a tree structure or hierarchy. We'll talk about trees toward the end of the semester, but following is some relevant tree terminology:

    • root - The root of a tree is outtermost element, in this case rss.
    • child - The children of an element are the elements it contains. The element channel is a child of rss. The element copyright is a child of channel. The element author is a child of item.
    • sibling - The siblings of an element are the elements that share its parent. The element image is a sibling of item.

Attributes

Attributes are name, value pairs that provide some information about the characteristics of an element. In the example above, the element rss has an attribute version. The version attribute has a value of 2.0. An element may have multiple attributes.

Parsing

There are two models for parsing XML: DOM and SAX.

DOM - Document Object Model

A DOM parser reads an XML document, for example from a file, and builds a tree in memory. The programmer can then access and manipulate the information stored in the document by traversing the tree structure. Essentially, the job of the parser is to identify where elements start and end, and build objects to represent each element.

SAX - Simple API for XML

A SAX parser reads an XML document and generates events when elements are found. The user defines the actions be taken as different types of elements are found.

Namespaces

If you take a look at the NPR Story of the Day, you'll notice that the XML looks a bit different.

<?xml version="1.0" encoding="utf-8"?>

<?xml-stylesheet title="XSL_formatting" type="text/xsl" href="/include/xsl/podcast.xsl"?>

<rss version="2.0" xmlns:npr="http://www.npr.org/rss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/">

<channel>

<title>NPR: Story of the Day</title>

<link>http://www.npr.org/?ft=2&amp;f=1090</link>

<description>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</description>

<copyright>Copyright 2007 NPR - For Personal Use Only</copyright>

<generator>NPR/RSS Generator 2.0</generator>

<lastBuildDate>Thu, 30 Aug 2007 01:06:17 EDT</lastBuildDate>

<language>en-us</language>

<itunes:summary>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:summary>

<itunes:subtitle>Editors&apos; Pick. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:subtitle>

<itunes:author>National Public Radio</itunes:author>

<itunes:keywords>story,of,the,day,NPR,National Public Radio,Story of the Day,Morning Edition,All Things Considered,Fresh Air</itunes:keywords>

<image>

<url>http://media.npr.org/images/podcasts/thumbnail/npr_sotd_image_75.jpg</url>

<title>Story of the Day</title>

<link>http://www.npr.org/?ft=2&amp;f=1090</link>

</image>

<itunes:category text="Arts"/>

<itunes:category text="Society &amp; Culture"/>

<itunes:owner>

<itunes:email/>

<itunes:name/>

</itunes:owner>

<itunes:image href="http://media.npr.org/images/podcasts/primary/npr_sotd_image_300.jpg"/>

<item>

<title>New Orleans Suffers Crisis in Mental Health Care</title>

<description>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</description>

<pubDate>Thu, 30 Aug 2007 01:06:08 EDT</pubDate>

<link>http://www.npr.org/templates/story/story.php?storyId=14031894&amp;ft=2&amp;f=1090</link>

<guid>http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3</guid>

<itunes:summary>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</itunes:summary>

<itunes:duration>0:13:27</itunes:duration>

<itunes:keywords>NPR,National Public Radio,New Orleans Suffers Crisis in Mental Health Care,</itunes:keywords>

<enclosure url="http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3" length="6456767" type="audio/mpeg"/>

</item>

</channel>

</rss>

Among other things, you see a set of tags that have the prefix itunes. As you might imagine, the elements with tags beginning with itunes provide information that can be used by the iTunes program when it processes the feed. A standard RSS reader can process this same feed, but may ignore any elements with tags in the itunes namespace.

The web page: http://www.feedforall.com/directory-namespace.htm lists some other common namespaces. Notice that the same tag suffix may appear in multiple namespaces. For example, two name spaces may support a summary tag. However, using the namespace prefix enables the developer to distinguish between say itunes:summary and summary in another namespace.

XML and Java

XMLTester.java - a very simple example

Java provides both DOM and SAX parsers in the javax.xml.parsers package. The DOM parser produces a Document object, where Document is in the org.w3c.dom package. The Document represents the entire XML tree, which is comprised of Node objects. The Node class provides an API to traverse the tree. Node has several subclasses, the most notable of which are Text and Element. All components in the tree are Nodes, but some are Elements and some are Text, and there are a few other subclasses as well. Below are a few of the most relevant APIs. For a full listing, see the Java API.

javax.xml.parsers

DocumentBuilderFactory - Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents.

    • DocumentBuilderFactory newInstance() - Obtain a new instance of a DocumentBuilderFactory.
    • DocumentBuilder newDocumentBuilder() - Creates a new instance of a DocumentBuilder using the currently configured parameters.

DocumentBuilder - Defines the API to obtain DOM Document instances from an XML document. Using this class, an application programmer can obtain a Document from XML.

    • Document parse(File f) - Parse the content of the given file as an XMLdocument and return a new DOM Document object.
    • abstract Document parse(InputSource is) - Parse the content of the given input source as an XMLdocument and return a new DOM Document object.
    • Document parse(InputStream is) - Parse the content of the given InputStream as an XML document and return anew DOM Document object.
    • Document parse(InputStream is, String systemId) - Parse the content of the given InputStream as an XML document and return anew DOM Document object.
    • Document parse(String uri) - Parse the content of thegiven URI as an XML document and return a new DOM Document object.

org.w3c.dom

Node

    • NodeList getChildNodes() - A NodeList that contains all children of this node.
    • Node getFirstChild() - The first child of this node.
    • Node getLastChild() - The last child of this node.
    • Node getNextSibling() - The node immediately following this node.
    • String getNodeName() - The name of this node, depending on its type; see the table above.
    • String getNodeValue() - The value of this node, depending on its type; see the table above.

Document

    • NodeList getElementsByTagName(String tagname) - Returns a NodeList of all the Elements in document order with a given tag name and are contained in the document.

Element

    • String getAttribute(String name) - Retrieves an attribute value by name.
    • String getTagName() - The name of the element.

NodeList

    • int getLength() - The number of nodes in the list.
    • Node item(int index) - Returns the indexth item in the collection.