The long way to a semantic web - Kore Nordmann

The long way to a semantic web

The problem of HTML and XHTML is obvious and known by everybody familiar with web application development. Both neither offer proper semantical markup for the website contents, nor do they offer an advanced layout language.

For several web projects I looked for a possibility to use a semantic markup for my contents, so that they accessible by others and even by software, and not loose the common possibilities for layouting my website, like the designer requested. Until now, I didn't find a solution for this, but let's go a bit into detail what the problems are...

Semantic markup

The problem has been described by a lot of persons and does not need much explanation, but I lets try a short summary.

XML - as it is an "extensible markup language" - defines the markup of some contents. Let's ignore, that HTML is not XML, but XHTML still is; The differences are not relevant here.

A markup language combines text and extra information about the text. The extra information, for example about the text's structure or presentation, is expressed using markup, which is intermingled with the primary text. (http://en.wikipedia.org/wiki/Markup_language)

With XHTML the markup still defines both, structure and presentation, which is not wanted, if you want to offer easy access by programs and effective layouting for your website. As a matter of fact XHTML does not solve any of those two problems properly. XHTML may have some more, and more clearly defined, structures then HTML, and it is often used in a better way. But this does not solve the problem, only reduces it a very little bit.

Semantic markup in XHTML

There are very basic structures for semantic markup, like lists, you define using <ul>, or headlines and paragraphs, using <h[1-6]> and <p>; Same for tables, etc. ... you all know this kind of stuff.

Even proposed for XHTML 2.0, a lot of things are still missing in the current specification, like <nl> for navigation lists - one of the most common things on each website; Same for breadcrumbs, articles, abstracts, etc.

Dublin core and RDF are possibilities to include a lot more stuff in your document to tag your contents in various ways, but you are still bound to the broken XHTML.

Real semantic markup

Real markup would start by defining a custom schema fitting your web application, or reusing a predefined one, like one for weblogs, or a project site. Those markups would only care about proper semantic markup and may also include and reuse existing namespaces like dublin core and RDF.

A very simple example for a blog site could look like:

<?xml version="1.0" encoding="UTF-8"?> <blog> <title>Some random thoughts</title> <!-- ... --> <posts> <post> <title>The long way to a semantic web</title> <description>...</description> <tags> <tag>XML</tag> <tag>PHP</tag> <!-- ... --> </tags> <comments> <!-- ... --> </comments> </post> <!-- ... --> </posts> </blog>

This just includes the content without any definitions how to layout the stuff. You should of course define a schema, when using such custom XML, so that other developers are really able to read and reuse your XML.

Reusing other namespaces

Like mentioned above there are a lot of namespaces you may reuse here, because they already define everything you need and other application may already correctly reuse them.

XLink

XLink is used to provide links between resources, like <a href=""/> in XHTML. This may be reused in the elements <blog> and <post>.

<?xml version="1.0" encoding="UTF-8"?> <blog xmlns:xlink="http://www.w3.org/1999/xlink"> <posts> <post xlink:href="/blog/the_long_long_way_to_semantic_web.txt" xlink:type="simple"> <!-- ... --> </post> </posts> </blog>

As you may guess from type="simple" there are also more complex link types - but those again do not matter here.

Dublic Core

Dublin Core is one of the XML definitions for meta data. You may use it to declare something simple as author and license of content, but there are also some more advanced features. Integrating in the blog example we just stay with the mentioned author and license...

<?xml version="1.0" encoding="UTF-8"?> <blog xmlns:dc="http://purl.org/dc/elements/1.1/"> <title>Some random thoughts</title> <dc:creator>Kore Nordmann</dc:creator> <dc:rights>CC by-sa</dc:rights> </blog>

Generate the output

Displaying such XML structures in the web you have the following three choices.

Formatting with CSS

You may just format the XML you defined above using CSS including a format in the head of your document definition.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/css" href="/blog.css"?>

But this very simple approach has several drawbacks.

  1. The content may not have the same order as it should have in the output. Imagine, that your blog posts are sorted chronologically, but you only want to show the latest ten blog posts with the newest on top...

  2. The capabilities of CSS are quite limited.

  3. Especially with the low number of elements you normally have in your website XML description when using a proper markup.

Transform to HTML using XSLT

XSLT offers a quite easy way to transform arbitrary XML to some other language, which may be XML, HTML or something completely different. Transforming all the content to HTML leaves you there, where HTML websites are today - with all the layout capabilities.

But what did you get then? On the other hand you still have the semantic markup in your source view, when the browser does the transformations, and you don't process the XSLT on the server side (which may have some hidden surprises for you ;). On the other hand you still have to cope with the pseudo semantic markup of HTML, where you of course shouldn't use tables for design etc.

This is nothing I want to do, and I am quite sure, that I am not alone with this feeling. Even the user might not notice this markup is HTML, it is just the wrong markup language for this, because it uses some random mix of semantical and structural markup. It is on its way to get more and more semantical which makes it even more and more useless for this task, even I second this development.

XSLT to transform XML to XML

Sounds useless? Think of some very custom XML markup just for layout. You may think of SVG, but I mean something which can cope with text better then SVG can ;). You could think of something like the glade files used for GTK, or some GUI structures from some random language poured into XML.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/css" href="/blog.css"?> <layout> <verticalGrid> <grid>$title</grid> <grid> <tree> <!-- ... --> </tree> </grid> </verticalGrid> </layout>

Just a very simple example to extend in your mind by yourself :). Each user could define his custom semantic markup and using the layout definition language he likes best. This would enable you to create really nice interfaces - you could embed SVG elements and format everything else with CSS:

Browser support

There are two basic things required for user interaction on websites.

So why am I not using this for my website?

Browser support

So there are four features the main browsers would need to support, so this would be usable. Now the sad part starts.

XML + CSS

XSLT

Links

Forms

Gecko

YES

YES

YES*

YES

Opera

YES

YES

NO

YES

KHtml

YES

YES

NO

YES

Internet Explorer

YES

YES

NO

NO

* The support is limited to a subset of the XLink specification, but at least simple XLinks are supported, which work like the links known from HTML.

Sadly the links using XLink are not clickable in any browser but the ones using the Gecko-Engine. There exists some workaround with Opera which seems not to work when the href-Attribute is in some namespace.

For links you could import the <a> element from XHTML, but this again would be an ugly hack.

I did not expect anything like supporting those standards from the Microsoft Internet Explorer, but I am really disappointed, that Opera and KHtml/Webkit do not support something simple like XLink for links in elements beside the standard XHTML a element. Not supporting the Internet Explorer would not matter for my personal site, but excluding the other two engines would really hurt.

Conclusion

The discussion about XHTML 2.0 and HTML 5.0 could be skipped, if the browsers would just support existing standards like XLink. You could define your custom semantic and structural markup, translate between them using XSLT and do not have any dependecies to crappy markup languages like HTML or XHTML any more.

Together with CSS 3, and the implementation of new features like calc() it then could be the first time I would be happy with the (semantic) markup in web environments.

Comments