The long way to a semantic web

First published at Wednesday, 29 August 2007

Warning: This blog post is more then 18 years old – read and use with care.

The long way to a semantic web

Semantic markup
Generate the output
Browser support
- Browser support
Conclusion
Comments

The problem of HTML and XHTML is obvious and known by everybody familiar with web application development. Both neither offer proper semantical markup for the website contents, nor do they offer an advanced layout language.

For several web projects I looked for a possibility to use a semantic markup for my contents, so that they accessible by others and even by software, and not loose the common possibilities for layouting my website, like the designer requested. Until now, I didn't find a solution for this, but let's go a bit into detail what the problems are...

Semantic markup

The problem has been described by a lot of persons and does not need much explanation, but I lets try a short summary.

XML - as it is an "extensible markup language" - defines the markup of some contents. Let's ignore, that HTML is not XML, but XHTML still is; The differences are not relevant here.

A markup language combines text and extra information about the text. The extra information, for example about the text's structure or presentation, is expressed using markup, which is intermingled with the primary text. (http://en.wikipedia.org/wiki/Markup_language)

With XHTML the markup still defines both, structure and presentation, which is not wanted, if you want to offer easy access by programs and effective layouting for your website. As a matter of fact XHTML does not solve any of those two problems properly. XHTML may have some more, and more clearly defined, structures then HTML, and it is often used in a better way. But this does not solve the problem, only reduces it a very little bit.

Semantic markup in XHTML

There are very basic structures for semantic markup, like lists, you define using <ul>, or headlines and paragraphs, using <h[1-6]> and <p>; Same for tables, etc. ... you all know this kind of stuff.

Even proposed for XHTML 2.0, a lot of things are still missing in the current specification, like <nl> for navigation lists - one of the most common things on each website; Same for breadcrumbs, articles, abstracts, etc.

Dublin core and RDF are possibilities to include a lot more stuff in your document to tag your contents in various ways, but you are still bound to the broken XHTML.

Real semantic markup

Real markup would start by defining a custom schema fitting your web application, or reusing a predefined one, like one for weblogs, or a project site. Those markups would only care about proper semantic markup and may also include and reuse existing namespaces like dublin core and RDF.

A very simple example for a blog site could look like:

<?xml version="1.0" encoding="UTF-8"?>
<blog>
    <title>Some random thoughts</title>
    <!-- ... -->

    <posts>
        <post>
            <title>The long way to a semantic web</title>
            <description>...</description>

            <tags>
                <tag>XML</tag>
                <tag>PHP</tag>
                <!-- ... -->
            </tags>

            <comments>
                <!-- ... -->
            </comments>
        </post>
        <!-- ... -->
    </posts>
</blog>

This just includes the content without any definitions how to layout the stuff. You should of course define a schema, when using such custom XML, so that other developers are really able to read and reuse your XML.

Reusing other namespaces

Like mentioned above there are a lot of namespaces you may reuse here, because they already define everything you need and other application may already correctly reuse them.

XLink

XLink is used to provide links between resources, like <a href=""/> in XHTML. This may be reused in the elements <blog> and <post>.

<?xml version="1.0" encoding="UTF-8"?>
<blog xmlns:xlink="http://www.w3.org/1999/xlink">
    <posts>
        <post
            xlink:href="/blog/the_long_long_way_to_semantic_web.txt"
            xlink:type="simple">
            <!-- ... -->
        </post>
    </posts>
</blog>

As you may guess from type="simple" there are also more complex link types - but those again do not matter here.

Dublic Core

Dublin Core is one of the XML definitions for meta data. You may use it to declare something simple as author and license of content, but there are also some more advanced features. Integrating in the blog example we just stay with the mentioned author and license...

<?xml version="1.0" encoding="UTF-8"?>
<blog xmlns:dc="http://purl.org/dc/elements/1.1/">
    <title>Some random thoughts</title>
    <dc:creator>Kore Nordmann</dc:creator>
    <dc:rights>CC by-sa</dc:rights>
</blog>

Generate the output

Displaying such XML structures in the web you have the following three choices.

Formatting with CSS

You may just format the XML you defined above using CSS including a format in the head of your document definition.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/blog.css"?>

But this very simple approach has several drawbacks.

The content may not have the same order as it should have in the output. Imagine, that your blog posts are sorted chronologically, but you only want to show the latest ten blog posts with the newest on top...
The capabilities of CSS are quite limited.
Especially with the low number of elements you normally have in your website XML description when using a proper markup.

Transform to HTML using XSLT

XSLT offers a quite easy way to transform arbitrary XML to some other language, which may be XML, HTML or something completely different. Transforming all the content to HTML leaves you there, where HTML websites are today - with all the layout capabilities.

But what did you get then? On the other hand you still have the semantic markup in your source view, when the browser does the transformations, and you don't process the XSLT on the server side (which may have some hidden surprises for you ;). On the other hand you still have to cope with the pseudo semantic markup of HTML, where you of course shouldn't use tables for design etc.

This is nothing I want to do, and I am quite sure, that I am not alone with this feeling. Even the user might not notice this markup is HTML, it is just the wrong markup language for this, because it uses some random mix of semantical and structural markup. It is on its way to get more and more semantical which makes it even more and more useless for this task, even I second this development.

XSLT to transform XML to XML

Sounds useless? Think of some very custom XML markup just for layout. You may think of SVG, but I mean something which can cope with text better then SVG can ;). You could think of something like the glade files used for GTK, or some GUI structures from some random language poured into XML.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/blog.css"?>
<layout>
    <verticalGrid>
        <grid>$title</grid>
        <grid>
            <tree>
                <!-- ... -->
            </tree>
        </grid>
    </verticalGrid>
</layout>

Just a very simple example to extend in your mind by yourself :). Each user could define his custom semantic markup and using the layout definition language he likes best. This would enable you to create really nice interfaces - you could embed SVG elements and format everything else with CSS:

Browser support

There are two basic things required for user interaction on websites.

Links
As shown above, the correct way to embed links in your website is to use the XLink namespace to define links for any of the elements.
Forms
If users should be able to submit some content to your website you need forms. As long as XForms are not available all you may get are the good old forms from HTML / XHTML. XHTML is XML, so you also may include them in your website using proper namespace definitions.

So why am I not using this for my website?

Browser support

So there are four features the main browsers would need to support, so this would be usable. Now the sad part starts.

	XML + CSS	XSLT	Links	Forms
Gecko	YES	YES	YES*	YES
Opera	YES	YES	NO	YES
KHtml	YES	YES	NO	YES
Internet Explorer	YES	YES	NO	NO

* The support is limited to a subset of the XLink specification, but at least simple XLinks are supported, which work like the links known from HTML.

Sadly the links using XLink are not clickable in any browser but the ones using the Gecko-Engine. There exists some workaround with Opera which seems not to work when the href-Attribute is in some namespace.

For links you could import the <a> element from XHTML, but this again would be an ugly hack.

I did not expect anything like supporting those standards from the Microsoft Internet Explorer, but I am really disappointed, that Opera and KHtml/Webkit do not support something simple like XLink for links in elements beside the standard XHTML a element. Not supporting the Internet Explorer would not matter for my personal site, but excluding the other two engines would really hurt.

Conclusion

The discussion about XHTML 2.0 and HTML 5.0 could be skipped, if the browsers would just support existing standards like XLink. You could define your custom semantic and structural markup, translate between them using XSLT and do not have any dependecies to crappy markup languages like HTML or XHTML any more.

Together with CSS 3, and the implementation of new features like calc() it then could be the first time I would be happy with the (semantic) markup in web environments.

Comments

Keith Alexander at Thursday, 30.8. 2007

Hi, RDF is more than a namespace to drop into XML documents - it's a uniform structure for data (triples). Likewise, the idea behind the semantic web is more than a web of semantically-marked up documents - it's a web of interlinked data.

You can describe the semantics of your data in a separate RDF file that your html can link to. (Have a look at the SIOC [http://sioc-project.org/] RDF Vocabulary for describing blogs and other types of web sites).

But there are also techniques for marking up (valid) web pages such that RDF can be extracted from them. GRDDL [http://www.w3.org/TR/grddl/] is a standard for linking to optional XSL transformations of your custom markup to RDF. RDFa is a nascent standard for adding attributes to XHTML in such a way as to express RDF statements within the web page. (However, I'd wait a little while until the specification is stable).

I use (for example, see my blog's pages) a syntax called eRDF [http://getsemantic.com/wiki/ERDF], which just uses existing HTML attributes to express RDF statements (RDFa uses new attributes). I use a @profile attribute to point to the GRDDL transformation into RDF. So you can extract the data from my web pages by piping them through a web service like http://triplr.org/. A benefit of this method (of using existing html attributes) is that I can hang CSS and javascript on these semantic hooks and browsers all understand it.

Kore at Thursday, 30.8. 2007

@Keith: Thanks for the links.

I have been a bit unclear about Dublin Core and RDF, because they should not be main topic of the blog post. Thanks for the clarifications on RDF.

I know what RDF basically does, I will come back to this soon in a completely different blog post when it comes to topologies and ontologies in web applications, where RDF and Tagging get relevant.

Martin Fjordvald at Thursday, 30.8. 2007

I know this is a bit off-topic but it still touches it somewhat. (sorry :P)

The problem with the ability to manipulate the content to such an extent is that the average user will be much better equipped to get content exactly as they see fit, which will naturally mean that advertisement will be left out.

Basically what this is, is the ideal web for academia, you have content semantically represented and styles can easily be swapped with minimum user effort.

Also, it's the exact opposite of what internet entrepreneurs want. A large part of internet content is funded by advertisements, as you are of course aware of, (not suggesting otherwise) which means that a lot of interesting content would disappear if removing the ads become too easy. Firefox extensions such as Adblock and, perhaps even worse, Greasemonkey enables the average user to remove it, semantic markup just makes it even easier.

It's a battle of two opposing sides, perhaps that would even explain why the internet is as messy as it currently is.

Idetrorce at Saturday, 15.12. 2007

very interesting, but I don't agree with you Idetrorce

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog:

A classic RSS feed (for example in Portalific)
I'll toot about it on mastodon
All updates will go to LinkedIn, as well

The long way to a semantic web

First published at Wednesday, 29 August 2007