Simple and efficient XML parsing using JAXB 2.0
My new default workflow for parsing an XML
document type takes about 5 minutes to go from seeing the new document to
generating a parser that can efficiently parse it with JAXB 2.0 at its
base.
Update 2: Performance Numbers (see
below)Update: 1.5 results (see
below)One of the things I often need
to do is to parse an XML document that doesn't have a DTD or schema but has a
well defined format and load the information I find into a database. Not only
do I need to parse it, but I also have to be very careful when I do because
generally the XML document could be arbitrarily large. Generally, these two
requirements have often been at odds with one another when you needed to pick an
XML parsing technology. On the one hand you would like it would be easy
understand the information so that implies you would like a typed interface to
the XML or at least a DOM interface. On the other hand, you need to parse using
some sort of event or streaming system so you don't use arbitrary amounts of
memory, so that implies SAX or
StAX.The best of both worlds exists
within JAXB 2.0. Not only can you get typed objects from your XML documents but
you can also drop down to StAX when you need to stream data from the document.
The biggest limitation is that JAXB does need some sort of schema in order to
generate the typed objects. Using a great tool called trang, we
can actually generate those from example documents that exhibit all the various
possibilities that may be in the
schema.I decided to do the demo of
this in Java 1.6 because 1) it includes JAXB 2.0 in the core classes now and 2)
it just came out for the Mac and I wanted to try it within IntelliJ. One of the
first things I ran into when moving to Java 1.6 is that Trang is currently
incompatible with it. I haven't looked into the cause; I just used a 1.5 VM to
execute it instead. We start with an XML file called foundbugs.xml that I
generated by running FindBugs
0.9.7 against itself. Then we execute Trang against as
follows:java -jar trang.jar -I
xml -O xsd foundbugs.xml
findbugs.xsdThat will produce a schema
from the XML document. Normally you would like to have a bunch of examples but
in the case that you don't you should look through it to make sure it isn't
being too restrictive. Generally trang does a good job of abstracting it and
picks a nice balance between restriction and genericism. Now we need to generate
our object model from that schema so we can get typed objects when we parse the
document. To do this we use the JAXB 2.0 XJC
tool:xjc -d src -p
com.sampullara.findbugs.om
findbugs.xsdThat will generate the
objects in the specified package under the given source root. Now lets write a
couple of tests to see how this works. First we'll parse without worrying about
the possible size of the document, simply creating a tree that represents the
entire
thing: public void testParseEntireDocument() throws JAXBException {
JAXBContext ctx = JAXBContext.newInstance(new Class[] {BugCollection.class});
Unmarshaller um = ctx.createUnmarshaller();
BugCollection bugs = (BugCollection) um.unmarshal(new File("src-data/foundbugs.xml"));
assertEquals(180, bugs.getBugInstance().size());
}That's all there is to it for the simple
case where we don't think that the documents will be big enough to make memory
problems. Let's say though, that we want to parse BugInstances separately since
there could be a nearly unlimited number of Bugs but still want to look at each
instance as a typed object. This will require a few more steps than the
previous one, but its still pretty simple and straight-forward. I'll pull it out
in sections to make sure that we cover it
sufficiently: public void testParseEfficiently() throws FileNotFoundException, XMLStreamException, JAXBException {
// Parse the data, filtering out the start elements
XMLInputFactory xmlif = XMLInputFactory.newInstance();
FileReader fr = new FileReader("src-data/foundbugs.xml");
XMLEventReader xmler = xmlif.createXMLEventReader(fr);
EventFilter filter = new EventFilter() {
public boolean accept(XMLEvent event) {
return event.isStartElement();
}
};
XMLEventReader xmlfer = xmlif.createFilteredReader(xmler, filter);
In this initialization step we are
going to start parsing the document using an XMLEventReader rather than starting
with a JAXB Unmarshaller. This will allow us to stream the document rather than
parse it all in one big chunk. We are applying an additional filter where we
will only see StartElement events so we don't waste time looking at events we
aren't interested in. This part of the code is mostly boiler plate so you
could abstract it out into something like createStartElementEventReader(Reader)
if you find yourself doing this a lot. Next we want to skip the outer
BugCollection element since it encompasses the entire document. If we were to
at this point just parse normally like above we would have the same sort of
limitations as before. So lets skip
it: // Jump to the first element in the document, the enclosing BugCollection
StartElement e = (StartElement) xmlfer.nextEvent();
assertEquals("BugCollection", e.getName().getLocalPart());
Since we are testing, we verify that
we did in fact jump to the correct start element. Now the fun part. Let's
assume that we actually want to look at all the child elements in all their
typed glory though we are only going to do something interesting with the
BugInstances in this example. In order to parse an element using JAXB we need
to have the XMLEventStream sitting on the StartElement event of the element you
want to parse. Instead of using nextEvent() we are going to use peek(). This
will advance us to the start element but not consume it. JAXB will then parse
that entire element, stopping when it finally consumes the end element. For
this example we'll simply count the number of BugInstances and confirm that we
find the right
number: // Parse into typed objects
JAXBContext ctx = JAXBContext.newInstance("com.sampullara.findbugs.om");
Unmarshaller um = ctx.createUnmarshaller();
int bugs = 0;
while (xmlfer.peek() != null) {
Object o = um.unmarshal(xmler);
if (o instanceof BugInstance) {
BugInstance bi = (BugInstance) o;
// process the bug instance
bugs++;
}
}
assertEquals(180, bugs);
fr.close();
}
There you have it. We were able to
drop down into StAX in order to avoid pulling the entire document into memory
yet we were able to look at the interesting parts of the document using typed
Java objects rather than SAX or StAX events. You might wonder how much more
efficient this is memory wise. In order to test that I'm going to add some code
to the beginning and the end of the test
methods: System.gc(); System.gc();
memstart = Runtime.getRuntime().freeMemory();
// Parsing code ...
System.gc(); System.gc();
long memend = Runtime.getRuntime().freeMemory();
System.out.println("Memory used: " + (memstart - memend));
I'll also move the BugInstance we
parse in the second case to a reference outside the loop so we have one in the
heap when we finally check. This will just be an approximate number but it
should give us a ballpark efficiency. Here are the results of running the two
tests:1. testParseEntireDocument()
=> Memory used: 10415922.
testParseEfficiently() => Memory used:
30792Quite a huge difference between
the two for this document. This documents actual size is 268487 bytes. Reading
it into a string will double that at least and it looks like the XML object
overhead in memory is about double that again. Not terrible, but something to
keep in mind when you are parsing documents. What I have found is that this
workflow for dealing with XML documents that I need to parse is easy to use,
reliable and very optimizable. What else could you ask
for?IntelliJ project with the example
code:
ParsingExample.jarUpdate:
For completeness, I also ran it under 1.5 though I had to change the way I was
calculating memory because 1.5 had different behavior than 1.6. Instead of using
free memory I used current used memory, calculated as:
Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(). With
this change and the inclusion we get the
results:1.
testParseEntireDocument() => Memory used:
11437202.
testParseEfficiently() => Memory used:
41312I had to add the
following jars to my project to use JAXB 2.0 and StAX under
1.5:activation.jar
jaxb-impl.jar jaxb1-impl.jar
stax-1.1.2-dev.jarjaxb-api.jar
jaxb-xjc.jar
jsr173_1.0_api.jarUpdate
2: I also went back and did some performance numbers. Environment: JDK 1.6, Mac
OS X 10.4.6, MacBook Pro 2.16 ghz, running in
IntelliJ:100 iterations a
piece. Doing as little work as possible to count the number of BugInstances for
the various
strategies:JAXB parse of
entire document: 59
ms/parseJAXB + StAX for memory
efficiency: 81 ms/parseStAX
only: 46 ms/parseSAX only: 10
ms/parseDOM only: 60
ms/parseAs you can see it
is more expensive to use the method here for reducing memory foot print. Also,
StAX is just strictly slower than SAX, probably because of the number of
allocations being done. I tried to use an alternate implementation of StAX as
well (Woodstox) but encountered a bug
that I couldn't easily workaround. Someone also suggested I use the
XMLStreamReader instead of the XMLEventReader but I found that filtering does
not work as it should with that style.
Posted: Sun - April 30, 2006 at 08:38 AM
|