XML Unmarshalling in Java: JAXB vs STax vs Woodstox

What is XML Unmarshalling?

Marshalling is the process of transforming the memory representation of an object to a data format suitable for storage or transmission, and it is typically used when data must be moved between different parts of a computer program or from one program to another. Marshalling is similar to serialization and is used to communicate to remote objects with an object, in this case a serialized object. It simplifies complex communication, using custom/complex objects to communicate instead of primitives. The opposite, or reverse, of marshalling is called unmarshalling (or demarshalling, similar to deserialization). 

When it comes to dealing with large amounts of XML data in a resource-friendly way. The main problem is processing large XML files in chunks while at the same time providing upstream/downstream systems with some data to process.

The main advantage of using JAXB is the quick time-to-market; if one possesses an XML schema, there are tools out there to auto-generate the corresponding Java domain model classes automatically (Eclipse Indigo, Maven jaxb plugins in various sauces, ant tasks, to name a few). The JAXB API then offers a Marshaller and an Unmarshaller to write/read XML data, mapping the Java domain model. JAXB keeps the whole objectificationof the XML schema in memory, so the obvious question was: “How would our infrastructure cope with large XML files (e.g. in my case with a number of elements > 100,000) if we were to use JAXB?”. I could have simply produced a large XML file, then a client for it and find out about memory consumption.

As one probably knows there are mainly two approaches to processing XML data in Java: DOM and SAX. With DOM, the XML document is represented into memory as a tree; DOM is useful if one needs cherry-pick access to the tree nodes or if one needs to write brief XML documents. On the other side of the spectrum there is SAX, an event-driven technology, where the whole document is parsed one XML element at the time, and for each XML significative event,  callbacks are “pushed” to a Java client which then deals with them (such as START_DOCUMENT, START_ELEMENT, END_ELEMENT, etc). Since SAX does not bring the whole document into memory but it applies acursor like approach to XML processing it does not consume huge amounts of memory. The drawback with SAX is that it processes the whole document start to finish;  this might not be necessarily what one wants for large XML documents. In my scenario, for instance, I’d like to be able to pass to downstream systems XML elements as they are available, but at the same time maybe I’d like to pass only 100 elements at the time, implementing some sort ofpaginationsolution. DOM seems too demanding from a memory-consumption point of view, whereas SAX seems to coarse-grained for my needs. 

STax, a Java technology which offered a middle ground between the capability topull XML elements(as opposed to pushing XML elements, e.g. SAX) while being RAM-friendly. Having said that, STax would be probably the compromiser; however If you wanted to keep the easy programming model offered by JAXB, You would really need a combination of the two. Woodstox ia yet another faster xml parser. After looking here and there, this is what I ended up finding from some benchmarking website which evaluates JAXB, STax and Woodstox.

Conclusions:

The results on all three different environments, although with some differences, all tell us the same story:

If you are looking for performance (e.g. XML unmarshalling speed), choose JAXB

If you are looking for low-memory usage (and are ready to sacrifice some performance speed), then use STax.

Sources: google, Wikipedia, java.dzone.com