Kore Nordmann
~~~~~~~~~~~~~
:Author: Kore Nordmann
:Date: Tue, 29 Mar 2011 12:20:26 +0200
:Revision: 1
:Copyright: CC by-sa
====================================
Generating XML schemas from XML data
====================================
:Description:
Already some time ago I published a tool on Github which allows you to
generate (or learn) XML schemas from XML data. You provide the tool with a
set of XML files and you get a nice, human readable, XML schema (XSD, DTD, …)
from that. Read more for the details.
Already some time ago I published a tool on Github which allows you to generate
(or learn) XML schemas from XML data. You provide the tool with a set of XML
files and you get a nice, human readable, XML schema (XSD, DTD, …) from that.
I developed that tool while writing my diploma thesis on `"Algorithmic learning
of XML Schema definitions from XML data"`__. The thesis__ itself contains the
theoretical background for schema learning, which is quite interesting (to me,
at least). On of the biggest issues existing tools (like trang__) fight with,
is that learning human readable regular expressions for the child patterns is
not trivial. Since recently there are some interesting new algorithms
available, which make it possible to learn sane regular expressions - which are
used in the diploma thesis and by my tool.
What have "regular expressions" to do with XML schemas you ask? Regular
expressions do not only work on bytes (or UTF-8 characters in PCRE), but also
on other things, like XML elements. In DTDs you, for example, specify which
elements may occur in another element using regular expressions::
You have a regular expressions ``(dt|dd)+`` for the elements which may occur
directly in ``dl``. A regular expressions, like ``(dt, dd*)+``, would for
example mean, that there may be any number of ``dt`` elements, each followed by
any number of ``dd`` elements.
__ /talks/11_03_learning_xml_schema_definitions_from_xml_data.pdf
__ /talks/11_03_learning_xml_schema_definitions_from_xml_data.pdf
__ http://www.thaiopensource.com/relaxng/trang.html
Using the tool
==============
You can get the `"XML schema learner"`__ from my Github account:
https://github.com/kore/XML-Schema-learner. Just clone it, and you can run the
tests, or use the ``learn`` command to infer XML schemas from XML data.
So let's see how we can generate a DTD from an example XML file::
-
Some stuff
23.42
-
Some other stuff
42.23
-
456
-
123
To get a DTD for that we can use the ``learn`` command provided by the tool,
which has several command line options you can learn about by typing ``learn
--help``. By default the tool returns a DTD for a set of provided XML files
(which can be just one). So, for the XML above, you will get the following::
$ ./learn examples/multitype.xml
You can see a human readable DTD schema for the XML above. For such trivial
cases each available tool out there will provide you with good results. But be
assured that this tool "always" manages to produce human readable regular
expressions for the child patterns.
__ https://github.com/kore/XML-Schema-learner
From DTDs to XML Schema
=======================
The difference between DTD and XML Schema is not just syntax. XML Schema has a
richer syntax for regular expressions but most importantly is has a different
typing mechanism, which exceeds the capabilities of DTD. In XML Schema it is
possible to have elements with the same name using a different type, if they
are located at different places in your XML tree.
See the ``- `` element above, which differs depending on the parent
element. With XML Schema you can use two different types for that, which makes
your schema a lot more specific. Additionally you can reuse the same type for
elements with different names. So that, for example, ```` and
```` could both refer to a type ``number``.
The "XML schema learner" can now learn schemas using the semantics of DTD and
just format them as XML Schema, but it can also learn full blown XML Schema
definitions. But here it gets a bit more complicated, and this is what my
thesis was actually about.
To decide if two slightly different types in different locations of the XML
tree should be considered one type or two is not easy. Since you seldom have
XML data expressing all allowed variants of your "virtual" schema you might not
want to be too strict about that.
There is no sane default, though, which is why the tool offers you several ways
to configure the locality (how many parent elements should be taken into
account to potentially tell different types apart) and different comparators
for merging the types. For the simple example above just setting the locality
to 1 works well, and results in a more specific schema, since the item types do
not occur anywhere else in the tree and thus do not need merging at all::
$ ./learn -t xsd --locality 1 examples/multitype.xml
As you can see two different types have been learned for the two different
definitions of the ``
- `` element.
To learn more about the comparators and how they affect the schema learning
process, please read my `thesis`__, which describes and compares them quite in
depth.
I hope this tool is useful for you - especially if you are not already using
schema definitions for your XML data. XML data without a schema a developer can
refer to is not much better then just binary data, in my honest opinion. (The
same is true for JSON btw. :)
__ /talks/11_03_learning_xml_schema_definitions_from_xml_data.pdf
..
Local Variables:
mode: rst
fill-column: 79
End:
vim: et syn=rst tw=79
Trackbacks
==========
Comments
========