Introduction

Bagri: The Native XML DB

In the modern World where data is playing a crucial part in business XML is everywhere. It is widely used in IT world for human-readable data interchange. Nearly every industry has its own set of XML schemas (XSD) which defines structure of XML documents “traveling” between parties (Finance: FpML, FIXML; Software Modeling: XMI, SysML; Oil & Gas: WITSML, PIDX; Publishing: DITA, DocBook; News: RSS, Atom; Music: XSPO; Health Care: HL7; Logistics & Transportation: GS1, SPEC2000; and so on).

Even according to quite dated Gartner report the number of produced XML documents exceeded thousands of billions a long time ago, and this number still grows every day. And all of them have to be processed and, in most cases, stored for subsequent analysis.

How this task solved currently? In most cases incoming XML documents are parsed using some standard technology like DOM, SAX, StAX, JAXB and then processing application extracts relevant data fields from XML body and store them in RDBMS. Subsequent data analysis is performed via SQL queries. In case when the system must produce output in XML format, it does this “on the fly” using data selected from DB. This is a well-known way to work with XML, but it has a number of drawbacks:

  • It is hard to describe relatively complex XML structure in entity-relationship model. And even harder to get data from the underlying relational tables efficiently
  • The data format of incoming XML documents evolve. New elements and groups come into the picture and as a result old DB schemas should be significantly reworked in order to store them, but analysis has to be done on both “old” and “new” data
  • Data extraction requirements evolve too. It looks like a simple requirement to start extracting some additional XML fields from the next project release. But, there is no simple way to get values for these fields for already processed documents
  • Parsing and validation of incoming XML documents are time and memory consuming operations

 

In order to solve these and some other drawbacks an IT community invented XML Database. Such systems receive data in XML form and then store and process them transparently as XML documents. They provide a set of common tools (languages) for efficient XML processing: XPath/XSLT/XQuery.

There are two flavors of XML Databases: (post-) Relational XML DBs (extensions to existing RDBMS from Oracle, IBM DB2, MS SQL, etc.) and Native XML DBs. There is an increasing interest in the second type of products last years. The most known products in this area are: Sedna, BaseX, eXist, Zorba, MarkLogic.