XML stands for Extensible Markup Language.XML was designed to carry data not to display data.XML was designed to transport and store data, with focus on what data is.
An XML Parser is a parser that is designed to read XML and create a way for programs to use XML. There are different types, and each has its advantages.
The different types of parsers are:- a) SAX b) DOM c) PULL
SAX PARSER
SAX stands for Simple API for XML.An event-based API (such as SAX) uses calls to report parsing events to the application.SAX is a lexical, event-driven interface in which a document is read serially.SAX is a push parser, since it pushes events out to the calling application.
DOM PARSER
DOM stands for Document Object Model. It differs from SAX in that it builds the entire XML document representation in memory and then hands the calling program the whole chunk of memory.
PULL PARSER
Pull parsers are useful in streaming applications.It and wait for the application to come calling.Pull parsing treats the document as a series of items which are read in sequence using the Iterator design pattern.It creates an iterator that sequentially visits the various elements, attributes, and data in an XML document.
Examples of pull parsers include StAX in the Java programming language.
Problem with Parsing
Here we will discuss about the DOM Parser and commonly faced problem.DOM parser has Pl/Sql interface.
Sample Code for XML Parsing using DOM Parser
FUNCTION PARSED_DOCUMENT ( p_directory IN VARCHAR2,
p_input_file_name IN VARCHAR2
)
RETURN xmldom.DOMDocument IS
l_parser xmlparser.parser;
l_retval xmldom.DOMDocument;
Begin
EXECUTE IMMEDIATE('CREATE OR REPLACE DIRECTORY TESTXML_IN AS '''||p_directory||'''');
l_parser := xmlparser.newParser;
xmlparser.parse (l_parser, p_directory || '/' || p_input_file_name);
l_retval := xmlparser.getDocument (l_parser);
dbms_xmlparser.freeParser (l_parser);
RETURN l_retval;
END PARSED_DOCUMENT;
Problem With above DOM Parsing
The above code will work fine for small file. If the file/line size is more than 32K, the above code will give Nasty error.The error message will read like
ORA-31001: Invalid resource handle or path name
"/ora_shared/R12/PROD/apps/apps_st/comn/temp/Applicants_20110316030008.xml"
ORA-06512: at "SYS.XDBURITYPE", line 11
ORA-06512: at "XDB.DBMS_XSLPROCESSOR", line 142
ORA-29280: invalid directory path
ORA-29280: invalid directory path
ORA-29284: file read error
ORA-06512: at line 9
Reason for Error
This is due to the limitation of DBMS_XMLPARSER .In 10g onwards xmlparser is replaced with DBMS_XMLPARSER. DBMS_XMLparser usages the UTL_FILE which has limitation of 32k.This is due to the bug 5204876.This is bug is not fixed still.Bug fix dependednt on some another internal bug.
Probable Solution
There are different ways to solve/bypass the above error. One of the possible solution is
--> Read the XMLfile into a CLOB variable first using DBMS_LOB.LOADCLOBFROMFILE and then use DBMS_XMLPARSER.PARSECLOB
As per the solution, instead of directly parse the XML document, First store the document in a CLOB variable now parse the Clob variable.
Sample Code for XML Parsing
FUNCTION PARSED_DOCUMENT ( p_directory IN VARCHAR2,
p_input_file_name IN VARCHAR2
)
RETURN xmldom.DOMDocument IS
l_parser xmlparser.parser;
l_retval xmldom.DOMDocument;
l_xml_data CLOB;
l_bfile_locator BFILE;
l_dest_offset INTEGER:=1;
l_src_offset INTEGER:=1;
l_Char_set_id NUMBER := NLS_CHARSET_ID('UTF8');
l_lang_context INTEGER := dbms_lob.default_lang_ctx;
l_warning INTEGER;
BEGIN
EXECUTE IMMEDIATE('CREATE OR REPLACE DIRECTORY EXPXML_IN AS '''||p_directory||'''');
/*First Load XML file into a CLOB variable.
*/
/*Start of reading XML file and storing it in CLOB variable
*/
l_bfile_locator:=bfilename('EXPXML_IN',p_input_file_name);
dbms_lob.createtemporary(l_xml_data, cache=>FALSE);
dbms_lob.OPEN(l_bfile_locator, dbms_lob.lob_readonly);
dbms_lob.loadclobfromfile(l_xml_data,
l_bfile_locator,
dbms_lob.getlength(l_bfile_locator),
l_dest_offset,
l_src_offset,
l_Char_set_id,
l_lang_context,
l_warning
);
dbms_lob.CLOSE(l_bfile_locator);
/*End of reading XML file and storing it in CLOB variable*/
/*Now Parse the XML stored in CLOB variable*/
l_parser := xmlparser.newParser;
xmlparser.parseclob(l_parser,l_xml_data);
l_retval := xmlparser.getDocument (l_parser);
dbms_xmlparser.freeParser (l_parser);
RETURN l_retval;
END PARSED_DOCUMENT;
Note:-
1) Rule of thumb for choosing between DOM and SAX:
DOM:
* DOM is very good when you need some sort of random access.
* DOM consumes more memory.
* DOM is also good when you are trying to transformations of some sort.
* DOM is also good when you want to have tree iteration and want to walk through the entire document tree.
* See if you can use more attributes over elements in your XML (to reduce the pipe size).
SAX:
* SAX is good when data comes in a streaming manner (using some input stream).
2) Pull parsers are useful in streaming applications, which are areas where either the data is too large to fit in memory, or the data is being assembled just in time for the next stage to use it. It is designed to be used with large data sources, and unlike SAX which returns every event, the pull parser can choose to skip events that it is not interested in.
3) DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed.
4) The DOM interface is most useful for structural manipulations of the XML tree, such as reordering elements, adding or deleting elements and attributes, renaming elements, and so on. For example, for the immediately preceding XML document, the DOM creates an in-memory tree structure
5) SAX is useful for applications that do not need to manipulate the XML tree, such as search operations, among others. The preceding XML document becomes a series of linear events.It does not build in-memory tree representations of the XML documents.
References
1) https://metalink.oracle.com
2) http://en.wikipedia.org/wiki/XML#Programming_interfaces
3) http://download.oracle.com/docs/cd/B14099_19/web.1012/b14033/adx03paj.htm#i1013320