Simple and efficient XML parsing using JAXB 2.0

My new default workflow for parsing an XML document type takes about 5 minutes to go from seeing the new document to generating a parser that can efficiently parse it with JAXB 2.0 at its base.
Update 2: Performance Numbers (see below)
Update: 1.5 results (see below)

One of the things I often need to do is to parse an XML document that doesn’t have a DTD or schema but has a well defined format and load the information I find into a database. Not only do I need to parse it, but I also have to be very careful when I do because generally the XML document could be arbitrarily large. Generally, these two requirements have often been at odds with one another when you needed to pick an XML parsing technology. On the one hand you would like it would be easy understand the information so that implies you would like a typed interface to the XML or at least a DOM interface. On the other hand, you need to parse using some sort of event or streaming system so you don’t use arbitrary amounts of memory, so that implies SAX or StAX.

The best of both worlds exists within JAXB 2.0. Not only can you get typed objects from your XML documents but you can also drop down to StAX when you need to stream data from the document. The biggest limitation is that JAXB does need some sort of schema in order to generate the typed objects. Using a great tool called trang, we can actually generate those from example documents that exhibit all the various possibilities that may be in the schema.

I decided to do the demo of this in Java 1.6 because 1) it includes JAXB 2.0 in the core classes now and 2) it just came out for the Mac and I wanted to try it within IntelliJ. One of the first things I ran into when moving to Java 1.6 is that Trang is currently incompatible with it. I haven’t looked into the cause; I just used a 1.5 VM to execute it instead. We start with an XML file called foundbugs.xml that I generated by running FindBugs 0.9.7 against itself. Then we execute Trang against as follows:

java -jar trang.jar -I xml -O xsd foundbugs.xml findbugs.xsd

That will produce a schema from the XML document. Normally you would like to have a bunch of examples but in the case that you don’t you should look through it to make sure it isn’t being too restrictive. Generally trang does a good job of abstracting it and picks a nice balance between restriction and genericism. Now we need to generate our object model from that schema so we can get typed objects when we parse the document. To do this we use the JAXB 2.0 XJC tool:

xjc -d src -p com.sampullara.findbugs.om findbugs.xsd

That will generate the objects in the specified package under the given source root. Now lets write a couple of tests to see how this works. First we’ll parse without worrying about the possible size of the document, simply creating a tree that represents the entire thing:

    public void testParseEntireDocument() throws JAXBException {
         JAXBContext ctx = JAXBContext.newInstance(new Class[] {BugCollection.class});
         Unmarshaller um = ctx.createUnmarshaller();
         BugCollection bugs = (BugCollection) um.unmarshal(new File("src-data/foundbugs.xml"));
         assertEquals(180, bugs.getBugInstance().size());
    }

That’s all there is to it for the simple case where we don’t think that the documents will be big enough to make memory problems. Let’s say though, that we want to parse BugInstances separately since there could be a nearly unlimited number of Bugs but still want to look at each instance as a typed object. This will require a few more steps than the previous one, but its still pretty simple and straight-forward. I’ll pull it out in sections to make sure that we cover it sufficiently:

    public void testParseEfficiently() throws FileNotFoundException, XMLStreamException, JAXBException {
         // Parse the data, filtering out the start elements
         XMLInputFactory xmlif = XMLInputFactory.newInstance();
         FileReader fr = new FileReader("src-data/foundbugs.xml");
         XMLEventReader xmler = xmlif.createXMLEventReader(fr);
         EventFilter filter = new EventFilter() {
             public boolean accept(XMLEvent event) {
                 return event.isStartElement();
             }
         };
         XMLEventReader xmlfer = xmlif.createFilteredReader(xmler, filter); 

In this initialization step we are going to start parsing the document using an XMLEventReader rather than starting with a JAXB Unmarshaller. This will allow us to stream the document rather than parse it all in one big chunk. We are applying an additional filter where we will only see StartElement events so we don’t waste time looking at events we aren’t interested in. This part of the code is mostly boiler plate so you could abstract it out into something like createStartElementEventReader(Reader) if you find yourself doing this a lot. Next we want to skip the outer BugCollection element since it encompasses the entire document. If we were to at this point just parse normally like above we would have the same sort of limitations as before. So lets skip it:

        // Jump to the first element in the document, the enclosing BugCollection
        StartElement e = (StartElement) xmlfer.nextEvent();
        assertEquals("BugCollection", e.getName().getLocalPart()); 

Since we are testing, we verify that we did in fact jump to the correct start element. Now the fun part. Let’s assume that we actually want to look at all the child elements in all their typed glory though we are only going to do something interesting with the BugInstances in this example. In order to parse an element using JAXB we need to have the XMLEventStream sitting on the StartElement event of the element you want to parse. Instead of using nextEvent() we are going to use peek(). This will advance us to the start element but not consume it. JAXB will then parse that entire element, stopping when it finally consumes the end element. For this example we’ll simply count the number of BugInstances and confirm that we find the right number:

        // Parse into typed objects
        JAXBContext ctx = JAXBContext.newInstance("com.sampullara.findbugs.om");
        Unmarshaller um = ctx.createUnmarshaller();
        int bugs = 0;
        while (xmlfer.peek() != null) {
             Object o = um.unmarshal(xmler);
             if (o instanceof BugInstance) {
                 BugInstance bi = (BugInstance) o;
                 // process the bug instance
                 bugs++;
             }
        }
        assertEquals(180, bugs);
        fr.close();
     } 

There you have it. We were able to drop down into StAX in order to avoid pulling the entire document into memory yet we were able to look at the interesting parts of the document using typed Java objects rather than SAX or StAX events. You might wonder how much more efficient this is memory wise. In order to test that I’m going to add some code to the beginning and the end of the test methods:

         System.gc(); System.gc();
         memstart = Runtime.getRuntime().freeMemory();
         // Parsing code ...
         System.gc(); System.gc();
         long memend = Runtime.getRuntime().freeMemory();
         System.out.println("Memory used: " + (memstart - memend)); 

I’ll also move the BugInstance we parse in the second case to a reference outside the loop so we have one in the heap when we finally check. This will just be an approximate number but it should give us a ballpark efficiency. Here are the results of running the two tests:

1. testParseEntireDocument() => Memory used: 1041592
2. testParseEfficiently() => Memory used: 30792

Quite a huge difference between the two for this document. This documents actual size is 268487 bytes. Reading it into a string will double that at least and it looks like the XML object overhead in memory is about double that again. Not terrible, but something to keep in mind when you are parsing documents. What I have found is that this workflow for dealing with XML documents that I need to parse is easy to use, reliable and very optimizable. What else could you ask for?

IntelliJ project with the example code: ParsingExample.jar

Update: For completeness, I also ran it under 1.5 though I had to change the way I was calculating memory because 1.5 had different behavior than 1.6. Instead of using free memory I used current used memory, calculated as: Runtime.getRuntime().totalMemory() – Runtime.getRuntime().freeMemory(). With this change and the inclusion we get the results:

1. testParseEntireDocument() => Memory used: 1143720
2. testParseEfficiently() => Memory used: 41312

I had to add the following jars to my project to use JAXB 2.0 and StAX under 1.5:

activation.jar jaxb-impl.jar jaxb1-impl.jar stax-1.1.2-dev.jar
jaxb-api.jar jaxb-xjc.jar jsr173_1.0_api.jar

Update 2: I also went back and did some performance numbers. Environment: JDK 1.6, Mac OS X 10.4.6, MacBook Pro 2.16 ghz, running in IntelliJ:

100 iterations a piece. Doing as little work as possible to count the number of BugInstances for the various strategies:

JAXB parse of entire document: 59 ms/parse
JAXB + StAX for memory efficiency: 81 ms/parse
StAX only: 46 ms/parse
SAX only: 10 ms/parse
DOM only: 60 ms/parse

As you can see it is more expensive to use the method here for reducing memory foot print. Also, StAX is just strictly slower than SAX, probably because of the number of allocations being done. I tried to use an alternate implementation of StAX as well (Woodstox) but encountered a bug that I couldn’t easily workaround. Someone also suggested I use the XMLStreamReader instead of the XMLEventReader but I found that filtering does not work as it should with that style.

This entry was posted in Technology and tagged , . Bookmark the permalink.
  • MILOUD

    COMMENT SRIALIZER ET DESERIALIZER AVEC JAXB

  • Neel Sukhadia

    The performance measurements are for for the entire document or for a single element's parse?Thanks.

  • http://www.javarants.com spullara

    Parsing the entire document.

  • http://www.dijkstra-ict.com/ Lolke Dijkstra

    This is very interesting and it probably works for many cases but I was wondering would it work if you need both the encompassing element and its children. For example:

    <batch>
    <info>some generic info</info>
    <..>

    <!– sequence of many transactions –>
    <tx>
    <src>…
    <addr>…
    </addr>
    </scr>
    <dst>…</dst>
    <..>
    </tx>
    <!– many more tx…–>
    </batch>

    So now I would like to process the batch having an object of type Batch:

    class Batch {
    public String getInfo()…
    public List<TxType> getTxs()…
    }

    and the Tx:

    class Tx {
    SrcType getScr()…
    DstType getDst()…

    }

    Obviously, using the sparse (memory efficient) approach I would expect the getTxs() to just return an empty List<TxType>, whereas in the straightforward approach this list would contain the actual Tx children.

    So, the thing is, when you use JAXB to parse the unmarshall the parent, it there a way to instruct it to not store the transaction in the List and use JAXB to process these separately?

    Thanks,
    Lolke

  • http://www.javarants.com spullara

    Hi Lolke,

    There is no way for JAXB to not parse the children of an element that you unmarshall. For your case I would use StAX to go all the way down to the Batch object then use JAXB to parse each individual transaction.

    Sam

  • http://www.dijkstra-ict.com/ Lolke Dijkstra

    Hi Sam,

    Thanks for your swift reply!

    Actually I've developed an alternative approach based on SAX and MDE. More specifically, what I do, I use a code generator to generate the code from the schema.

    The approach uses a common framework and the generator extends these framework classes. In principle for each complextype a JavaBean class is generated, but whether or not this class should be stored with its parent is configurable. For more information please see my website: http://www.dijkstra-ict.com

    I also wrote an article that explains the approach. You can find it on the site.
    Let me know if you're interested to see how it compares to the alternative.

    Kind regards,
    Lolke Dijkstra

  • Joe

    Thanks for your great effort.
    When i tried to test the code which you wrote, it throws the following exception while unmarshalling

    java.lang.IllegalStateException: reader must be on a START_ELEMENT event, not a 4 event
    at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:368)
    at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:354)
    at com.fawry.test.TestJAXBUnmarshaller.testParseEfficiently(TestJAXBUnmarshaller.java:72)
    at com.fawry.test.TestJAXBUnmarshaller.main(TestJAXBUnmarshaller.java:31)

    I hope if you have answer to my question

  • Joe

    Thanks for your great effort.
    When i tried to test the code which you wrote, it throws the following exception while unmarshalling

    java.lang.IllegalStateException: reader must be on a START_ELEMENT event, not a 4 event
    at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:368)
    at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:354)
    at com.fawry.test.TestJAXBUnmarshaller.testParseEfficiently(TestJAXBUnmarshaller.java:72)
    at com.fawry.test.TestJAXBUnmarshaller.main(TestJAXBUnmarshaller.java:31)

    I hope if you have answer to my question

  • Pingback: Java – Object XML Mapping « WebMoli – Rediscover the Basics

  • http://www.harward.us/~nharward/ Nathaniel Harward

    I realize this post is rather old, but in the event anyone (including Sam) is still reading it — this post got me thinking and I created something (free, GPL) somewhere in between the examples here and the LDK+ library mentioned by Lolke. It's in its infancy, but perhaps someone would find it useful.

  • http://www.harward.us/~nharward/ Nathaniel Harward

    And here's the link which I conveniently forgot :)

    http://code.google.com/p/ndh-commons/

  • Sak

    Try
    // Parse into typed objects
    JAXBContext ctx = JAXBContext.newInstance(“com.sampullara.findbugs.om”);
    Unmarshaller um = ctx.createUnmarshaller();
    int bugs = 0;
    while (xmlfer.peek() != null) {
    Object o = um.unmarshal(xmlfer);
    if (o instanceof BugInstance) {
    BugInstance bi = (BugInstance) o;
    // process the bug instance
    bugs++;
    }
    }
    assertEquals(180, bugs);
    fr.close();
    }

  • Gord

    Exactly what I needed. Thank you so much.

  • Lemmedaskeren

    Thank you so much! Much better and simpler than the Oracle/SUN tutorials.

  • Srinivas C

    Hi Sam, Thanks a lot for this blog. It helped me in completing me a POC. Just wanted to update to the readers of the blog. I was able to incorporate WoodStox 4.1with JAXB 2.2 to your exapmple and the results were pretty amazing. A 350 MB file was parsed and unmarshalled in just 18 secs.

  • Prateek

    thanks..really helped me.

  • Lolke Dijkstra

    The URL is outdated. It should be dijkstra-ict.nl or dijkstra-ict.eu now.
    Alternatively: xml2j.net or xml2java.net

  • https://www.google.com/accounts/o8/id?id=AItOawmIRh1QP6GZLvotW4Bg2C-vBYG2UTZsd5Y juandiego.moreira

    Great post! Thank you soooooooo much.

  • https://www.google.com/accounts/o8/id?id=AItOawmIRh1QP6GZLvotW4Bg2C-vBYG2UTZsd5Y juandiego.moreira

    Great post! Thank you soooooooo much.

  • David Webber

    We have just made available tooling to help create JAXB bindings for complex XSD schema. 

    See http://www.cameditor.org/#JAXB_Bindings for how to guide.

  • Jose

    excellent! it works!