Simple and efficient XML parsing using JAXB 2.0

My new default workflow for parsing an XML document type takes about 5 minutes to go from seeing the new document to generating a parser that can efficiently parse it with JAXB 2.0 at its base.

Update 2: Performance Numbers (see below)
Update: 1.5 results (see below)

One of the things I often need to do is to parse an XML document that doesn’t have a DTD or schema but has a well defined format and load the information I find into a database. Not only do I need to parse it, but I also have to be very careful when I do because generally the XML document could be arbitrarily large. Generally, these two requirements have often been at odds with one another when you needed to pick an XML parsing technology. On the one hand you would like it would be easy understand the information so that implies you would like a typed interface to the XML or at least a DOM interface. On the other hand, you need to parse using some sort of event or streaming system so you don’t use arbitrary amounts of memory, so that implies SAX or StAX.

The best of both worlds exists within JAXB 2.0. Not only can you get typed objects from your XML documents but you can also drop down to StAX when you need to stream data from the document. The biggest limitation is that JAXB does need some sort of schema in order to generate the typed objects. Using a great tool called trang, we can actually generate those from example documents that exhibit all the various possibilities that may be in the schema.

I decided to do the demo of this in Java 1.6 because 1) it includes JAXB 2.0 in the core classes now and 2) it just came out for the Mac and I wanted to try it within IntelliJ. One of the first things I ran into when moving to Java 1.6 is that Trang is currently incompatible with it. I haven’t looked into the cause; I just used a 1.5 VM to execute it instead. We start with an XML file called foundbugs.xml that I generated by running FindBugs 0.9.7 against itself. Then we execute Trang against as follows:

java -jar trang.jar -I xml -O xsd foundbugs.xml findbugs.xsd

That will produce a schema from the XML document. Normally you would like to have a bunch of examples but in the case that you don’t you should look through it to make sure it isn’t being too restrictive. Generally trang does a good job of abstracting it and picks a nice balance between restriction and genericism. Now we need to generate our object model from that schema so we can get typed objects when we parse the document. To do this we use the JAXB 2.0 XJC tool:

xjc -d src -p com.sampullara.findbugs.om findbugs.xsd

That will generate the objects in the specified package under the given source root. Now lets write a couple of tests to see how this works. First we’ll parse without worrying about the possible size of the document, simply creating a tree that represents the entire thing:

public void testParseEntireDocument() throws JAXBException {
         JAXBContext ctx = JAXBContext.newInstance(new Class[] {BugCollection.class});
         Unmarshaller um = ctx.createUnmarshaller();
         BugCollection bugs = (BugCollection) um.unmarshal(new File("src-data/foundbugs.xml"));
         assertEquals(180, bugs.getBugInstance().size());
    }

That’s all there is to it for the simple case where we don’t think that the documents will be big enough to make memory problems. Let’s say though, that we want to parse BugInstances separately since there could be a nearly unlimited number of Bugs but still want to look at each instance as a typed object. This will require a few more steps than the previous one, but its still pretty simple and straight-forward. I’ll pull it out in sections to make sure that we cover it sufficiently:

public void testParseEfficiently() throws FileNotFoundException, XMLStreamException, JAXBException {
         // Parse the data, filtering out the start elements
         XMLInputFactory xmlif = XMLInputFactory.newInstance();
         FileReader fr = new FileReader("src-data/foundbugs.xml");
         XMLEventReader xmler = xmlif.createXMLEventReader(fr);
         EventFilter filter = new EventFilter() {
             public boolean accept(XMLEvent event) {
                 return event.isStartElement();
             }
         };
         XMLEventReader xmlfer = xmlif.createFilteredReader(xmler, filter);

In this initialization step we are going to start parsing the document using an XMLEventReader rather than starting with a JAXB Unmarshaller. This will allow us to stream the document rather than parse it all in one big chunk. We are applying an additional filter where we will only see StartElement events so we don’t waste time looking at events we aren’t interested in. This part of the code is mostly boiler plate so you could abstract it out into something like createStartElementEventReader(Reader) if you find yourself doing this a lot. Next we want to skip the outer BugCollection element since it encompasses the entire document. If we were to at this point just parse normally like above we would have the same sort of limitations as before. So lets skip it:

// Jump to the first element in the document, the enclosing BugCollection
        StartElement e = (StartElement) xmlfer.nextEvent();
        assertEquals("BugCollection", e.getName().getLocalPart());

Since we are testing, we verify that we did in fact jump to the correct start element. Now the fun part. Let’s assume that we actually want to look at all the child elements in all their typed glory though we are only going to do something interesting with the BugInstances in this example. In order to parse an element using JAXB we need to have the XMLEventStream sitting on the StartElement event of the element you want to parse. Instead of using nextEvent() we are going to use peek(). This will advance us to the start element but not consume it. JAXB will then parse that entire element, stopping when it finally consumes the end element. For this example we’ll simply count the number of BugInstances and confirm that we find the right number:

// Parse into typed objects
        JAXBContext ctx = JAXBContext.newInstance("com.sampullara.findbugs.om");
        Unmarshaller um = ctx.createUnmarshaller();
        int bugs = 0;
        while (xmlfer.peek() != null) {
             Object o = um.unmarshal(xmler);
             if (o instanceof BugInstance) {
                 BugInstance bi = (BugInstance) o;
                 // process the bug instance
                 bugs++;
             }
        }
        assertEquals(180, bugs);
        fr.close();
     }

There you have it. We were able to drop down into StAX in order to avoid pulling the entire document into memory yet we were able to look at the interesting parts of the document using typed Java objects rather than SAX or StAX events. You might wonder how much more efficient this is memory wise. In order to test that I’m going to add some code to the beginning and the end of the test methods:

System.gc(); System.gc();
         memstart = Runtime.getRuntime().freeMemory();
         // Parsing code ...
         System.gc(); System.gc();
         long memend = Runtime.getRuntime().freeMemory();
         System.out.println("Memory used: " + (memstart - memend));

I’ll also move the BugInstance we parse in the second case to a reference outside the loop so we have one in the heap when we finally check. This will just be an approximate number but it should give us a ballpark efficiency. Here are the results of running the two tests:

1. testParseEntireDocument() => Memory used: 1041592
2. testParseEfficiently() => Memory used: 30792

Quite a huge difference between the two for this document. This documents actual size is 268487 bytes. Reading it into a string will double that at least and it looks like the XML object overhead in memory is about double that again. Not terrible, but something to keep in mind when you are parsing documents. What I have found is that this workflow for dealing with XML documents that I need to parse is easy to use, reliable and very optimizable. What else could you ask for?

IntelliJ project with the example code: ParsingExample.jar

Update: For completeness, I also ran it under 1.5 though I had to change the way I was calculating memory because 1.5 had different behavior than 1.6. Instead of using free memory I used current used memory, calculated as: Runtime.getRuntime().totalMemory() — Runtime.getRuntime().freeMemory(). With this change and the inclusion we get the results:

1. testParseEntireDocument() => Memory used: 1143720
2. testParseEfficiently() => Memory used: 41312

I had to add the following jars to my project to use JAXB 2.0 and StAX under 1.5:

activation.jar jaxb-impl.jar jaxb1-impl.jar stax-1.1.2-dev.jar
jaxb-api.jar jaxb-xjc.jar jsr173_1.0_api.jar

Update 2: I also went back and did some performance numbers. Environment: JDK 1.6, Mac OS X 10.4.6, MacBook Pro 2.16 ghz, running in IntelliJ:

100 iterations a piece. Doing as little work as possible to count the number of BugInstances for the various strategies:

JAXB parse of entire document: 59 ms/parse
JAXB + StAX for memory efficiency: 81 ms/parse
StAX only: 46 ms/parse
SAX only: 10 ms/parse
DOM only: 60 ms/parse

As you can see it is more expensive to use the method here for reducing memory foot print. Also, StAX is just strictly slower than SAX, probably because of the number of allocations being done. I tried to use an alternate implementation of StAX as well (Woodstox) but encountered a bug that I couldn’t easily workaround. Someone also suggested I use the XMLStreamReader instead of the XMLEventReader but I found that filtering does not work as it should with that style.