Used the build in package pulldom(xml.pulldom). A great advantage of pulldom is this utility is event drive, just expand the element when it is needed. Which is very useful when user just need scan the file but not necessary to reserve the context information in the memory.
In this example, I tried minidom package(xml.minidom) first, but my memory usage is up to 90% and the speed is really slow. Then I want to use third package lxml, which is a Pythonic binding for the libxml2 and libxslt libraries. The package works fine, but it still need store all the document in the memory, I think this library will be very useful when we need the context information of the xml file. But in my example, I just need to scan all the citizen element, and expand the element when I occur it, so pulldom is very effective here.
code fragment:
for event, node in doc:
if event == pulldom.START_ELEMENT and node.localName == "citizen":
citizen = Citizen()
doc.expandNode(node)
parseCitizen(citizen, node)
citizens.append(citizen)
lengthC = len(citizens)
if lengthC%100 == 0 and lengthC != 0:
outputCitizen(citizens, lengthC-100, lengthC)
The event pulldom supportted:
START_ELEMENT END_ELEMENT COMMENT START_DOCUMENT END_DOCUMENT CHARACTERS PROCESSING_INSTRUCTION IGNORABLE_WHITESPACE