First published at Tuesday, 29 March 2011

Warning: This blog post is more then 11 years old – read and use with care.

Generating XML schemas from XML data

Already some time ago I published a tool on Github which allows you to generate (or learn) XML schemas from XML data. You provide the tool with a set of XML files and you get a nice, human readable, XML schema (XSD, DTD, …) from that.

I developed that tool while writing my diploma thesis on "Algorithmic learning of XML Schema definitions from XML data". The thesis itself contains the theoretical background for schema learning, which is quite interesting (to me, at least). On of the biggest issues existing tools (like trang) fight with, is that learning human readable regular expressions for the child patterns is not trivial. Since recently there are some interesting new algorithms available, which make it possible to learn sane regular expressions - which are used in the diploma thesis and by my tool.

What have "regular expressions" to do with XML schemas you ask? Regular expressions do not only work on bytes (or UTF-8 characters in PCRE), but also on other things, like XML elements. In DTDs you, for example, specify which elements may occur in another element using regular expressions:

<!ELEMENT dl (dt|dd)+>

You have a regular expressions (dt|dd)+ for the elements which may occur directly in dl. A regular expressions, like (dt, dd*)+, would for example mean, that there may be any number of dt elements, each followed by any number of dd elements.

Using the tool

You can get the "XML schema learner" from my Github account: Just clone it, and you can run the tests, or use the learn command to infer XML schemas from XML data.

So let's see how we can generate a DTD from an example XML file:

<shop> <sale> <item id="23"> <name>Some stuff</name> <price currency="EUR">23.42</price> </item> <item id="42"> <name>Some other stuff</name> <price currency="EUR">42.23</price> </item> </sale> <stock> <item id="23"> <amount>456</amount> </item> <item id="42"> <amount>123</amount> </item> </stock> </shop>

To get a DTD for that we can use the learn command provided by the tool, which has several command line options you can learn about by typing learn --help. By default the tool returns a DTD for a set of provided XML files (which can be just one). So, for the XML above, you will get the following:

$ ./learn examples/multitype.xml <!ELEMENT name (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT item ( ( amount | ( name, price ) ) )> <!ELEMENT sale ( item* )> <!ELEMENT amount (#PCDATA)> <!ELEMENT stock ( item* )> <!ELEMENT shop ( ( sale, stock ) )> <!ATTLIST price currency CDATA #REQUIRED> <!ATTLIST item id CDATA #REQUIRED>

You can see a human readable DTD schema for the XML above. For such trivial cases each available tool out there will provide you with good results. But be assured that this tool "always" manages to produce human readable regular expressions for the child patterns.

From DTDs to XML Schema

The difference between DTD and XML Schema is not just syntax. XML Schema has a richer syntax for regular expressions but most importantly is has a different typing mechanism, which exceeds the capabilities of DTD. In XML Schema it is possible to have elements with the same name using a different type, if they are located at different places in your XML tree.

See the <item> element above, which differs depending on the parent element. With XML Schema you can use two different types for that, which makes your schema a lot more specific. Additionally you can reuse the same type for elements with different names. So that, for example, <price> and <amount> could both refer to a type number.

The "XML schema learner" can now learn schemas using the semantics of DTD and just format them as XML Schema, but it can also learn full blown XML Schema definitions. But here it gets a bit more complicated, and this is what my thesis was actually about.

To decide if two slightly different types in different locations of the XML tree should be considered one type or two is not easy. Since you seldom have XML data expressing all allowed variants of your "virtual" schema you might not want to be too strict about that.

There is no sane default, though, which is why the tool offers you several ways to configure the locality (how many parent elements should be taken into account to potentially tell different types apart) and different comparators for merging the types. For the simple example above just setting the locality to 1 works well, and results in a more specific schema, since the item types do not occur anywhere else in the tree and thus do not need merging at all:

$ ./learn -t xsd --locality 1 examples/multitype.xml <?xml version="1.0"?> <schema xmlns=""> <!-- ... --> <complexType name="sale/item"> <sequence> <element name="name" type="string"/> <element name="price" type="item/price"/> </sequence> <attribute name="id" type="string" use="required"/> </complexType> <complexType name="stock/item"> <element name="amount" type="string"/> <attribute name="id" type="string" use="required"/> </complexType> <!-- ... --> </schema>

As you can see two different types have been learned for the two different definitions of the <item> element.

To learn more about the comparators and how they affect the schema learning process, please read my thesis, which describes and compares them quite in depth.

I hope this tool is useful for you - especially if you are not already using schema definitions for your XML data. XML data without a schema a developer can refer to is not much better then just binary data, in my honest opinion. (The same is true for JSON btw. :)