Current state of the Semantic Web

First published at Wednesday, 22 October 2008

Warning: This blog post is more then 17 years old – read and use with care.

Current state of the Semantic Web

Introduction

The term semantic web describes the vision, that the information, which is currently available online, will be available to software for analysis, understanding and relating of contents. This paper offers an introduction to the current state of the semantic web and the ongoing development.

Semantic web

Semantic web generally means associating semantics to the data, which can be found on the Internet. The Internet itself is considered as graph of linked documents (data). The documents and the links may have associated semantics.

For now the documents, which can be found in the Internet, are mostly HTML relating to each other by URLs, also known as links, a descriptor of the documents location. HTML offers embeddings of such links into the documents to relate to other documents. Technologies for embedding semantic information into the HTML and adding semantic information to the relations will be described in further detail in this paper.

Semantics

Semantics generally mean the association of meaning to structures. The word semantics originates [1] from Greek σημαντικός (semantikos), "significant" [2] , which originates from σημαίνω (semaino), "to signify, to indicate" and that from σήμα (sema), "sign, mark, token". [3]

In computer science semantics describes the meaning of a word of some language. Different words of the same language or words from different languages may have the same semantics.

An example for this are the two simple mathematical expressions, (1 + 2) and +(1 2), which do use a different notation, but have the same intuitive semantics, the addition of 1 and 2. The syntax differentiates, in the first expression infix notation is used, while the second one uses a prefix notation (polish notation).

It is not clearly described what the term "meaning" constitutes - that differs in interpretation between semanticists. For some it is only the denotation of the given word, for others it is also the associated connotation of a word.

Denotation and connotation

Connotation of a term commonly describes the association the reader with the term, while denotation means the actual meaning of the term.

The denotation can be defined intensional or extensional. The intensional denotation means, that the term is described by the properties of the described object, like the term "man" may be described by "walks on two legs", "mammal". A extensional denotation describes a term by the set of objects described by the term. The term "man" would then describe all men which currently live on earth, will live in the future and ever have lived on earth.

W3C - Semantic Web

The W3C defines the "Semantic Web" as a common framework to describe the meaning of data, which especially means the semantic technologies developed and defined by the W3C. Some of the technologies, defined as part of this framework [4] are described in further detail in this paper, like RDF, SPARQL and OWL.

History

Tim Berners-Lee created the first version of HTML [5] in 1989 in Geneva. It focused solely on very simple text markup and knew basic formatting like headers, internal and external links and listings. The semantics were focused on the context required for documents published in the CERN institute, which includes elements like <address> for all kind of addresses, which today is deprecated from HTML.

When the Internet gained more popularity, more elements were required to be able to make more and different contents accessible to the viewers of websites. The introduced elements still focused on describing the rendering of the contents, instead of the semantics associated with the contents, to create a graphically more appealing presentation to its users. The focus was neither accessibility nor to provide software mechanisms to understand or relate contents.

Web applications gained more and more popularity, while HTML was used as the pure presentation layer. The application generating the HTML knew about the data semantics and relations. Only the human user could recognize the semantic meanings from the formatting of the contents and the textual contents itself.

Goal

The plain presentation of data was fine until the amount of data exceeded the dimensions easily discoverable by a human being. Search machines like Altavista, Yahoo! and Google appeared which tried to index the data provided in billions of documents and build relations between these documents. The heuristic metrics used by the search engines improved, but are far away from being able to answer complex questions, like:

Which are the capital cities of African countries?

The aim of the Semantic Web is, to make the available data easier accessible for software, especially for search engines, to provide the user with answers to such questions.

Current state

The latest versions and drafts of XHTML still focus on the semantic application of documents, and by default offer elements for the distinction of elements common in document markup, like headers, lists, images and tables.

XHTML still contains various elements, which, by definition, do not have any semantic meaning associated like <div> and <span>, and which sole purpose is to be used for the definition of rendering directives for the client.

When limiting yourself to the semantic subset of XHTML, using the elements in the way they are intended to be used, XHTML may work as a semantic markup language for the application of documents, like shown in the following example:

<html>
 <head>
  <title>Semantic Web</title>
  <meta name="author" content="Kore Nordmann" />
 </head>
 <body>
  <h1>Introduction</h1>
  <p>The term <em>semantic web</em> describes the vision,
  that the information, ...</p>
 </body>
</html>

Due to weaknesses in the rendering engines of the clients, elements were often misused to accomplish a desired style of the website. For example tables are often used comparable to grid layout elements in GUI application definition languages, e.g. to place contents at the desired positions on a page. [6]

Visual semantics

For human beings, who just see the representation rendered by the client (web browser), semantics of elements are often interpreted depending on the style of the element. For example a short big bold text before several lines of smaller printed text is mostly considered a caption, no matter which kind of markup is used for that. Because many creators of documents just target human beings they only care about the representation. This results in element misuse, like <a class="title"/>. In this case the link element may be layouted as a title, while the class attribute cannot have any special meaning to the software, at least in the context of XHTML. As there is no specification for such class elements, they may even be abbreviations or in different languages.

Because of the misuse of elements, software reading websites can never be sure if a (X)HTML element is used because of its semantic meaning or just because of the assigned styles. As a consequence techniques were developed to determine the document semantic back from the associated styles, like described in "HTML page analysis based on visual cues" by Yudong Yang et al.. [7]

Development

XHTML 2 aimed to provide a more document centric format by adding <section> elements for better structure of document contents and omitting purely presentational markup. This move has been countered by the (X)HTML 5 initiative, which explicitly substantiates the presentational character of (X)HTML, by commenting XHTML 2: [8]

XHTML2 defines a new HTML vocabulary with better features for hyperlinks, multimedia content, annotating document edits, rich meta data, declarative interactive forms, and describing the semantics of human literary works such as poems and scientific papers.
However, it lacks elements to express the semantics of many of the non-document types of content often seen on the Web. For instance, forum sites, auction sites, search engines, online shops, and the like, do not fit the document metaphor well, and are not covered by XHTML2.

Ian Hickson, HTML 5, W3C Working Draft 22 January 2008

To judge this comment correctly one must know that the new elements introduced by HTML 5, are not really aiming at providing semantic markup for the mentioned applications, but only elements, which provide access to new media types and therefore allow to embed more non-semantic content. [9]

The intention behind this statement was to prevent XHTML moving to a semantic document markup language, but maintain the various presentational features. (X)HTML 5 itself does not provide any semantical markup for the mentioned applications, but just adds better support for various media formats.

Including XML based languages

Since XHTML uses XML as its syntax it is of course possible to embed other languages into XHTML providing additional semantic markup. This is especially necessary if you want to use markup beyond the purely document orientated scope of XHTML.

It is quite common to embed the languages mentioned in the following section in HTML, even again the rendering engines of the clients may cause problems, like described later in this paper in the section Problems embedding other languages.

Evolving extensions

The already mentioned Semantic Web working group, part of the W3C, develops different technologies to add semantic information to arbitrary documents in the web. Beside those efforts, smaller initiatives turned up, which try to solve the problem with simpler technologies to make it more accessible for developers.

One of those emerging technologies are Microformats, which will be discussed as an example, before the W3C technologies RDF, RDF schema and OWL are discussed.

Microformats

In HTML and CSS the class attribute of HTML elements is often used to apply a coherent formatting to a set of elements. The class is not necessarily semantic markup but often explicit representational rendering instructions. Even several semantically different elements may be rendered the same way, it proved useful to use semantic class attribute values to enhance the maintainability of style definitions.

With semantic class attribute values it is possible to switch the layout of a website to a new layout, while semantically different website elements, which had the same style before, may look different in the new layout. Based on this usage and the increase of tag based folksonomies [10] Microformats were specified.

Folksonomies and Tagging

Folksonomy is a portmanteau of the two words folk and taxonomy, which basically means taxonomies [11] created by a user, or a group of users. It is considered as a synonym for "Social tagging" or "Collaborative Tagging", where a group of users assign "tags" to objects.

The "tags" are words, often with constraints on the used character set, which try to describe the object they were assigned too. In most cases there are no limited sets of tags, which may be assigned to objects, which leads to synonyms, homonyms and polysemy in the tags semantic. Especially tags are often used to not only describe the denotation of an object, but are selected based on users connotations.

Composition

Microformats were developed in the context of XHTML and blogs to add semantic markup to often reoccurring elements in those documents, and offer markup for elements which are not possible to specify with XHTML. One often used element, without proper semantic markup in the latest versions of XHTML, are calendars and dates. A simple example shows the usage of Microformats in such a case:

<div class="vevent">
  <span class="summary">PG 513 Seminraphase</span>:
  <abbr class="dtstart" title="2007-10-05">October 5</abbr>-
  <abbr class="dtend" title="2007-10-20">19</abbr>, at
  <span class="location">Dortmund...</span>
</div>

The example shows, that the actual XHTML elements used do not matter, but the semantic markup just depends on the given class attributes. XHTML and CSS support any amount of class attributes, separated by spaces, so they won't even interfere with the common website markup.

Defined Standards

The list of applications for Microformat based semantic markup is rather short at the current state. There is no dedicated organization, like the W3C, behind the standardization initiative. The defined applications still focus on blogging and nearby contexts.

hCalendar

Microformat markup used to embed the iCalendar [12] standard inside XML and XHTML based languages. The iCalendar format is used as a markup language for distributing calendar events, like known from CalDAV servers.

hCard

Microformat based markup to embed vCard [13] contact information inside XHTML.

rel

Microformat standard using the rel attribute of Links in XHTML to define the type of relations between documents. The relations yet specified are "nofollow", "license" and "tag".

nofollow: Actually not really semantic markup, but still included in the specification. This processing instruction specifies how search robots should weight the given link in their pagerank analysis.
license: A reference to the license of the current document.
tag: Defines the linked entity as a tag with a link to more resources using the same tag.

VoteLinks

Markup for votings, specifying the type of voting (pro, contra) by the user and the resource the user votes on.

XFN

Markup for relations between persons represented by documents in the Internet. Contains various tag like descriptors for the exact persons relations.

XMDP

Document metadata descriptions, like already part of XHTML headers.

Current Usage

In the blogosphere [14] and related websites Microformats are especially used to define the relations between people. Search engines like Yahoo! use information embedded with the hCard and hCalendar standards to present their users enhanced search results in their SearchMonkey [15] project.

Critics

The Microformat standard currently has a very limited set of applications, but the Microformat projects develops more standards, for different applications. [16] The approach is compatible to existing technologies and easy to integrate, because it normally does not interfere with the existing markup.

Microformats do not allow to specify custom ontologies or taxonomies. Until new standards are not defined in the Microformats project for a custom application there is no way to express the semantics of the application. A tag based approach, used in a way similar to Microformats, with definitions what the custom tags mean may work, but does not ensured forward compatibility to future new Microformat standards. There is no concept like name-spaces for Microformats.

There is no formal definition language for structures described by Microformats, like a required microformat hierarchy as seen in the event example, or required elements. But since they integrate in XML based languages they could at least be queried using languages like XPath.

With the usage by search engines like Yahoo! and the ease of integration with existing technologies some think this technology might be driving force behind a future Semantic Web. [17]

HTML, CSS, JavaScript and DOM will be the basic content standards in the foreseeable future. I think evolution on the web will be based on these formats, and this is what WHAT and AJAX do. We will also see a bunch of Microformats being developed, and that's how the semantic web will be built, I believe.

Håkon Wium Lie, CTO of Opera Software (2005)

RDF

The Resource Description Framework is used to describe resources by associating metadata to the resources. For this, triples are defined, which consists of the resource, which should be described, a predicate and an object. The predicate defines the type of description, while the object defines the value of the description, an example:

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/">
    <rdf:Description rdf:about="/blog/the_long_way_to_semantic_web.html">
      <dc:creator>Kore Nordmann</dc:creator>
      <dc:title>The long way to a semantic web</dc:title>
      <dc:date>Thu, 25 Oct 2007 11:20:13 +0200</dc:date>
      <dc:rights>CC by-sa</dc:rights>
      <dc:language>en</dc:language>
      <dc:format>text/html</dc:format>
      <dc:description> HTML does not work, neither does XHTML.
        So how can this be solved without waiting years for
        better browsers?
      </dc:description>
    </rdf:Description>
</rdf:RDF>

In this case the resource "/blog/the_long_way_to_semantic_web.html" is described, which can be resolved to an absolute URL because of the location where this definition resides. For this resource a set of predicates are defined, like for example "creator" and "title". Those predicates contain the respective literal objects "Kore Nordmann" and "The long way to a semantic web". The predicates used in this example belong to the common Dublin Core namespace.

The XML representation used in this example is only one possible representation of RDF descriptions. This representation is especially used inside XHTML web sites and other XML based document formats like SVG. Another commonly used notation is the easier readable Notation 3 (N3), developed by Tim Berners-Lee. RDF itself is not bound to any representational schema. The above example using N3 would look like:

@prefix dc: <http://purl.org/dc/elements/1.1/>.
</blog/the_long_way_to_semantic_web.html>
  dc:title "The long way to a semantic web";
  dc:publisher "Kore Nordmann".

Dublin Core

Dublin Core is a simple and standardized set of conventions to described document metadata and similar objects in the Internet. Dublin Core is not only used inside RDF, but there are also a set of XHTML meta element keys using the Dublin Core namespace. It has already been developed 1994, and contains a set of 15 core elements, with some additional optional element refinement fields. The Dublin Core standard is maintained by the Dublin Core Metadata Initiative. [18]

RDF Graph

The resources and objects in RDF can be considered as nodes and the predicates as edges in a directed graph. In the graphical representation resources are drawn as ellipses and the objects are as rectangles. The example from above could the look like:

RDF graph

The properties of the RDF graph can itself again be described using RDF, like shown at the Dublin Core definition page at http://purl.org/dc/elements/1.1/. The objects in the RDF graph can itself again be resources, which is also described as reification. Reification basically means the conversion of plain relations into concepts.

In the example above the "license" could have been described by a resource, which then could be described further by RDF including specifications about the license properties. Properties for example could be, if the license is considered OpenSource, GPL compatible, etc..

RDF query languages

RDF is already used by a lot web resources on the Internet, like Wikipedia. With this data infrastructure querying such data gets interesting. Similar languages to the Structured Query Language (SQL) has been developed for querying RDF data, like RDQL and SPARQL. Especially for SPARQL several implementations already exist in different languages and frameworks. An example query to fetch all articles created by "Kore Nordmann" could look like:

PREFIX dc: <http://purl.org/dc/elements/1.1/#>
SELECT ?title
WHERE {
  ?doc dc.title ?title.
  ?doc dc.creator "Kore Nordmann".
}

The PREFIX defines the ontologies used in the query, similar to XML namespaces with their reference to the used XML schema. The variables after SELECT, which can either start with ? or $, define the values, which should be returned from the query. The conditions in the WHERE section have to match the subject (resource), predicate, object structure of RDF and limit the subset of returned values. All conditions of the WHERE group must match, while more complex query structures using sub conditions, filters and optional matches can be defined. [19]

RDF schema

RDFS, or RDF schema, describes the vocabulary for the RDF metadata descriptions. Similar to DTD or XSD for XML, RDFS specifies the terms for shared communication. An example for such a definition is the above mentioned Dublin Core. Such a vocabulary can be called an ontology. With a proper ontology the question mentioned above could be phrased in SPARQL:

PREFIX abc: <http://example.org/exampleOntology#>
SELECT ?capital ?country
WHERE {
  ?x abc:cityname ?capital.
  ?y abc:countryname ?country.
  ?x abc:isCapitalOf ?y.
  ?y abc:isInContinent abc:africa.
}

Ontologies

Ontologies define sets of concepts and sets of instances of those concepts. This could be compared to classes and its instances in the object orientated software development. Each class should describe the denotation of its instances. There may be inheritance and multiple inheritance between concepts, like the concept "apple" inherits from the concept "fruit". If the inheritance relation is strictly hierarchical we talk about taxonomies.

Instances of the concepts, like one given apple "AppleA" may relate to other instances, like: "AppleA" "inHandOf" "John Smith" by relations. Relations may exists between instances, concepts and even between instances and concepts.

Ontologies additionally may contain axioms, defining knowledge which is always true and can not be conducted from the given concepts. One axiom in the example above could be, that instances of the class apple always fall to the ground, when they do not relate with "inHandOf" to an instance of the class "human". Ontologies with contained axioms are known as high-level ontologies.

Example ontology

OWL

OWL, [20] the Web Ontology Language, is a specification by the W3C to specify ontologies with a formal description language. The intention is to describe the terms of an application in a way, that software agents are able to interpret (understand) the given context.

Like RDF schema OWL bases on the RDF syntax, but the power of its expressions goes beyond RDF schema. Additionally to RDF and RDF schema it allows to include expressions like known from first-order logic (predicate logic).

Just like RDF schema OWL reuses the RDF resource, predicate, object triples to relate classes and instances between each other. The RDF schema predicates are still part of the OWL definitions and extended by several additional predicates.

An incomplete example for a concept which can be described using OWL using its expressions:

<owl:Class rdf:ID="NonFrenchWine">
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Class rdf:about="#Wine"/>
    <owl:Class>
      <owl:complementOf>
        <owl:Restriction>
          <owl:onProperty rdf:resource="#locatedIn" />
          <owl:hasValue rdf:resource="#FrenchRegion" />
        </owl:Restriction>
      </owl:complementOf>
    </owl:Class>
  </owl:intersectionOf>
</owl:Class>

This defines the concept of non-french wine, by intersecting the concepts wine with all objects which are not located in France.

Summary

The involved technologies differentiate in complexity and usefulness for human beings and computers.

Semantic technology relations (not really equidistant)

While HTML is easy to write and even easy to read for humans - especially if representational directives are assigned to contents - the complexity of OWL makes it hard to grasp the actual semantics of an instance. On the other hand OWL is intended to provide enough context to let software perform autarchic reasoning on the database. Because its complexity it is yet mostly used in academic research, medical and military applications to model the specific domain. Rule based semantic reasoners like Bossom exist and may already be used to used for software reasoning.

Currently the still human readable semantic specification languages like Microformats and RDF with some known ontologies like Dublin Core seem to be the most used approaches for intentional semantic markup on the web.

Simplifying the web

A simplification to the above mentioned approaches are custom XML based markup languages. Those custom markup specifications can be designed to perfectly match one application. One example is Docbook, [21] an advanced XML based document markup language with semantic markup for several document elements not included in XHTML.

An example for such a custom markup language for a blog, reusing other already defined markup languages, could look like:

<?xml version="1.0" encoding="UTF-8"?>
<blog xmlns="http://example.org/blog">
  <title>Some random thoughts</title>
  <!-- (RDF metadata)... -->
  <posts>
    <post>
      <title>The long way to a semantic web</title>
      <description>...</description>
      <tags>
        <tag>XML</tag> <tag>Semantics</tag> <!-- ... -->
      </tags>
      <article xmlns="http://docbook.org/ns/docbook">
        <!-- ... -->
      </article>
      <comments>  <!-- ... --> </comments>
    </post>
    <!-- ... -->
  </posts>
</blog>

Such markup is easily interpretable by software, which knows the associated schema. It is easy to write and develop for the application developers. The biggest problem remains the maintenance of the associated schema.

Of course this only indirectly implies an ontology, so that no automatic software reasoning is possible without additional definitions.

Problems embedding other languages

Currently there are several major problems with embedding different XML dialects into XHTML. Cascading Style Sheets (CSS) in the state currently supported by the major browsers is not aware of XML namespaces, since this is only part of CSS 3. This problem is solvable by not using namespace prefixes for any elements, but only declare namespaces per XML sub tree, like in the last example. Those elements can then can be addressed normally by CSS, which currently is totally namespace unaware, by their local names.

XLink [22] is intended to be used to define links between documents. Such links, embedded inside the documents are currently not supported by all browsers. [23] So for some browsers it is not possible to define click-able links inside custom XML based documents.

There still remains the option of using XSLT to transform the used semantic markup language to HTML with all of its representational markup, either on the client or server side. In this case HTML would be misused as a plain formatting language, which never was its intention and again means violating a standard.

Synoptics

Currently there is nothing like a semantic web, either because of the complexity of the approaches like RDF schema and OWL, or because of the limitations imposed by Microformats or especially HTML.

Simple and easy definitions of a markup language, without the dependencies on languages like XSLT and HTML could help to make it easier for software to extract data from the web. With optionally associated ontologies, for example defined using OWL, even software reasoning, like used in artificial intelligence could be developed for those documents.