:Author: Kore Nordmann
:Date: Fri, 31 Oct 2008 16:31:52 +0100
Current state of the Semantic Web
:Author: Kore Nordmann
For university I have been asked to summarize the current state of the
Semantic Web. Existing and emerging technologies. There was a quite tight
limitation on the amount of content, but it resulted in a good overview on
this topic with a slight focus on computer science.
.. contents:: Table of Contents
The term **semantic web** describes the vision, that the information, which is
currently available online, will be available to software for analysis,
understanding and relating of contents. This paper offers an introduction to
the current state of the semantic web and the ongoing development.
Semantic web generally means associating semantics to the data, which
can be found on the Internet. The Internet itself is considered as
graph of linked documents (data). The documents and the links may
have associated semantics.
For now the documents, which can be found in the Internet, are mostly
HTML relating to each other by URLs, also known as links, a descriptor
of the documents location. HTML offers embeddings of such links into
the documents to relate to other documents. Technologies for embedding
semantic information into the HTML and adding semantic information
to the relations will be described in further detail in this paper.
Semantics generally mean the association of meaning to structures. The word
semantics originates [#]_ from Greek σημαντικός (semantikos), "significant"[#]_
, which originates from σημαίνω (semaino), "to signify, to indicate" and that
from σήμα (sema), "sign, mark, token". [#]_
In computer science semantics describes the meaning of a word of some
language. Different words of the same language or words from different
languages may have the same semantics.
An example for this are the two simple mathematical expressions, ``(1 + 2)``
and ``+(1 2)``, which do use a different notation, but have the same intuitive
semantics, the addition of 1 and 2. The syntax differentiates, in the first
expression infix notation is used, while the second one uses a prefix notation
It is not clearly described what the term "meaning" constitutes - that differs
in interpretation between semanticists. For some it is only the denotation of
the given word, for others it is also the associated connotation of a word.
Denotation and connotation
Connotation of a term commonly describes the association the reader with the
term, while denotation means the actual meaning of the term.
The denotation can be defined intensional or extensional. The intensional
denotation means, that the term is described by the properties of the
described object, like the term "man" may be described by "walks on two legs",
"mammal". A extensional denotation describes a term by the set of objects
described by the term. The term "man" would then describe all men which
currently live on earth, will live in the future and ever have lived on earth.
W3C - Semantic Web
The W3C__ defines the "Semantic Web" as a common framework to describe the
meaning of data, which especially means the semantic technologies developed
and defined by the W3C. Some of the technologies, defined as part of this
framework [#]_ are described in further detail in this paper, like RDF, SPARQL
Tim Berners-Lee created the first version of HTML [#]_ in 1989 in Geneva. It
focused solely on very simple text markup and knew basic formatting like
headers, internal and external links and listings. The semantics were focused
on the context required for documents published in the CERN institute, which
includes elements like for all kind of addresses, which today is
deprecated from HTML.
When the Internet gained more popularity, more elements were required to be
able to make more and different contents accessible to the viewers of
websites. The introduced elements still focused on describing the rendering of
the contents, instead of the semantics associated with the contents, to create
a graphically more appealing presentation to its users. The focus was neither
accessibility nor to provide software mechanisms to understand or relate
Web applications gained more and more popularity, while HTML was used as the
pure presentation layer. The application generating the HTML knew about the
data semantics and relations. Only the human user could recognize the semantic
meanings from the formatting of the contents and the textual contents itself.
The plain presentation of data was fine until the amount of data exceeded the
dimensions easily discoverable by a human being. Search machines like
Altavista, Yahoo! and Google appeared which tried to index the data provided
in billions of documents and build relations between these documents. The
heuristic metrics used by the search engines improved, but are far away from
being able to answer complex questions, like:
Which are the capital cities of African countries?
The aim of the Semantic Web is, to make the available data easier accessible
for software, especially for search engines, to provide the user with answers
to such questions.
The latest versions and drafts of XHTML still focus on the semantic
application of documents, and by default offer elements for the distinction of
elements common in document markup, like headers, lists, images and tables.
XHTML still contains various elements, which, by definition, do not have any
semantic meaning associated like
and , and which sole purpose is
to be used for the definition of rendering directives for the client.
When limiting yourself to the semantic subset of XHTML, using the elements in
the way they are intended to be used, XHTML may work as a semantic markup
language for the application of documents, like shown in the following
The term semantic web describes the vision,
that the information, ...
Due to weaknesses in the rendering engines of the clients, elements were often
misused to accomplish a desired style of the website. For example tables are
often used comparable to grid layout elements in GUI application definition
languages, e.g. to place contents at the desired positions on a page. [#]_
For human beings, who just see the representation rendered by the client (web
browser), semantics of elements are often interpreted depending on the style
of the element. For example a short big bold text before several lines of
smaller printed text is mostly considered a caption, no matter which kind of
markup is used for that. Because many creators of documents just target human
beings they only care about the representation. This results in element
misuse, like . In this case the link element may be layouted
as a title, while the class attribute cannot have any special meaning to the
software, at least in the context of XHTML. As there is no specification for
such class elements, they may even be abbreviations or in different languages.
Because of the misuse of elements, software reading websites can never
be sure if a (X)HTML element is used because of its semantic meaning
or just because of the assigned styles. As a consequence techniques
were developed to determine the document semantic back from the associated
styles, like described in "HTML page analysis based on visual
cues" by Yudong Yang et al.. [#]_
XHTML 2 aimed to provide a more document centric format by adding
elements for better structure of
document contents and omitting purely presentational markup. This
move has been countered by the (X)HTML 5 initiative, which explicitly
substantiates the presentational character of (X)HTML, by commenting
XHTML 2: [#]_
XHTML2 defines a new HTML vocabulary with better features for hyperlinks,
multimedia content, annotating document edits, rich meta data, declarative
interactive forms, and describing the semantics of human literary works
such as poems and scientific papers.
However, it lacks elements to express the semantics of many of the
non-document types of content often seen on the Web. For instance, forum
sites, auction sites, search engines, online shops, and the like, do not
fit the document metaphor well, and are not covered by XHTML2.
-- Ian Hickson, HTML 5, W3C Working Draft 22 January 2008
To judge this comment correctly one must know that the new elements introduced
by HTML 5, are not really aiming at providing semantic markup for the
mentioned applications, but only elements, which provide access to new media
types and therefore allow to embed more non-semantic content. [#]_
The intention behind this statement was to prevent XHTML moving to a semantic
document markup language, but maintain the various presentational features.
(X)HTML 5 itself does not provide any semantical markup for the mentioned
applications, but just adds better support for various media formats.
Including XML based languages
Since XHTML uses XML as its syntax it is of course possible to embed other
languages into XHTML providing additional semantic markup. This is especially
necessary if you want to use markup beyond the purely document orientated
scope of XHTML.
It is quite common to embed the languages mentioned in the following section
in HTML, even again the rendering engines of the clients may cause problems,
like described later in this paper in the section `Problems embedding other
The already mentioned Semantic Web working group, part of the W3C, develops
different technologies to add semantic information to arbitrary documents in
the web. Beside those efforts, smaller initiatives turned up, which try to
solve the problem with simpler technologies to make it more accessible for
One of those emerging technologies are Microformats, which will be discussed
as an example, before the W3C technologies RDF, RDF schema and OWL are
In HTML and CSS the class attribute of HTML elements is often used to apply a
coherent formatting to a set of elements. The class is not necessarily
semantic markup but often explicit representational rendering instructions.
Even several semantically different elements may be rendered the same way, it
proved useful to use semantic class attribute values to enhance the
maintainability of style definitions.
With semantic class attribute values it is possible to switch the layout of a
website to a new layout, while semantically different website elements, which
had the same style before, may look different in the new layout. Based on this
usage and the increase of tag based folksonomies [#]_ Microformats were
Folksonomies and Tagging
Folksonomy is a portmanteau of the two words folk and taxonomy, which
basically means taxonomies [#]_ created by a user, or a group of users. It is
considered as a synonym for "Social tagging" or "Collaborative Tagging", where
a group of users assign "tags" to objects.
The "tags" are words, often with constraints on the used character set, which
try to describe the object they were assigned too. In most cases there are no
limited sets of tags, which may be assigned to objects, which leads to
synonyms, homonyms and polysemy in the tags semantic. Especially tags are
often used to not only describe the denotation of an object, but are selected
based on users connotations.
Microformats were developed in the context of XHTML and blogs to add semantic
markup to often reoccurring elements in those documents, and offer markup for
elements which are not possible to specify with XHTML. One often used
element, without proper semantic markup in the latest versions of XHTML, are
calendars and dates. A simple example shows the usage of Microformats in such
PG 513 Seminraphase
The example shows, that the actual XHTML elements used do not matter, but the
semantic markup just depends on the given class attributes. XHTML and CSS
support any amount of class attributes, separated by spaces, so they won't
even interfere with the common website markup.
The list of applications for Microformat based semantic markup is rather short
at the current state. There is no dedicated organization, like the W3C, behind
the standardization initiative. The defined applications still focus on
blogging and nearby contexts.
Microformat markup used to embed the iCalendar [#]_ standard inside XML and
XHTML based languages. The iCalendar format is used as a markup language
for distributing calendar events, like known from CalDAV servers.
Microformat based markup to embed vCard [#]_ contact information inside
Microformat standard using the rel attribute of Links in XHTML to define
the type of relations between documents. The relations yet specified are
"nofollow", "license" and "tag".
Actually not really semantic markup, but still included in the
specification. This processing instruction specifies how search robots
should weight the given link in their pagerank analysis.
A reference to the license of the current document.
Defines the linked entity as a tag with a link to more resources using
the same tag.
Markup for votings, specifying the type of voting (pro, contra) by the
user and the resource the user votes on.
Markup for relations between persons represented by documents in the
Internet. Contains various tag like descriptors for the exact persons
Document metadata descriptions, like already part of XHTML headers.
In the blogosphere [#]_ and related websites Microformats are especially used
to define the relations between people. Search engines like Yahoo! use
information embedded with the hCard and hCalendar standards to present their
users enhanced search results in their SearchMonkey [#]_ project.
The Microformat standard currently has a very limited set of applications, but
the Microformat projects develops more standards, for different
applications. [#]_ The approach is compatible to existing technologies and
easy to integrate, because it normally does not interfere with the existing
Microformats do not allow to specify custom ontologies or taxonomies. Until
new standards are not defined in the Microformats project for a custom
application there is no way to express the semantics of the application. A tag
based approach, used in a way similar to Microformats, with definitions what
the custom tags mean may work, but does not ensured forward compatibility to
future new Microformat standards. There is no concept like name-spaces for
There is no formal definition language for structures described by
Microformats, like a required microformat hierarchy as seen in the event
example, or required elements. But since they integrate in XML based languages
they could at least be queried using languages like XPath.
With the usage by search engines like Yahoo! and the ease of integration with
existing technologies some think this technology might be driving force behind
a future Semantic Web. [#]_
in the foreseeable future. I think evolution on the web will be based
on these formats, and this is what WHAT and AJAX do. We will also
see a bunch of Microformats being developed, and that's how the semantic
web will be built, I believe.
-- Håkon Wium Lie, CTO of Opera Software (2005)
The Resource Description Framework is used to describe resources by
associating metadata to the resources. For this, triples are defined, which
consists of the resource, which should be described, a predicate and an
object. The predicate defines the type of description, while the object
defines the value of the description, an example::
The long way to a semantic web
Thu, 25 Oct 2007 11:20:13 +0200
HTML does not work, neither does XHTML.
So how can this be solved without waiting years for
In this case the resource "/blog/the_long_way_to_semantic_web.html" is
described, which can be resolved to an absolute URL because of the location
where this definition resides. For this resource a set of predicates are
defined, like for example "creator" and "title". Those predicates contain the
respective literal objects "Kore Nordmann" and "The long way to a semantic
web". The predicates used in this example belong to the common Dublin Core
The XML representation used in this example is only one possible
representation of RDF descriptions. This representation is especially used
inside XHTML web sites and other XML based document formats like SVG. Another
commonly used notation is the easier readable *Notation 3* (N3), developed by
Tim Berners-Lee. RDF itself is not bound to any representational schema. The
above example using N3 would look like::
@prefix dc: .
dc:title "The long way to a semantic web";
dc:publisher "Kore Nordmann".
Dublin Core is a simple and standardized set of conventions to described
document metadata and similar objects in the Internet. Dublin Core is not only
used inside RDF, but there are also a set of XHTML meta element keys using the
Dublin Core namespace. It has already been developed 1994, and contains a set
of 15 core elements, with some additional optional element refinement fields.
The Dublin Core standard is maintained by the Dublin Core Metadata
The resources and objects in RDF can be considered as nodes and the predicates
as edges in a directed graph. In the graphical representation resources are
drawn as ellipses and the objects are as rectangles. The example from above
could the look like:
.. image:: rdf_graph.png
:alt: RDF graph
The properties of the RDF graph can itself again be described using RDF, like
shown at the Dublin Core definition page at http://purl.org/dc/elements/1.1/.
The objects in the RDF graph can itself again be resources, which is also
described as reification. Reification basically means the conversion of plain
relations into concepts.
In the example above the "license" could have been described by a resource,
which then could be described further by RDF including specifications about
the license properties. Properties for example could be, if the license is
considered OpenSource, GPL compatible, etc..
RDF query languages
RDF is already used by a lot web resources on the Internet, like Wikipedia.
With this data infrastructure querying such data gets interesting. Similar
languages to the Structured Query Language (SQL) has been developed for
querying RDF data, like RDQL and SPARQL. Especially for SPARQL several
implementations already exist in different languages and frameworks. An
example query to fetch all articles created by "Kore Nordmann" could look
?doc dc.title ?title.
?doc dc.creator "Kore Nordmann".
The PREFIX defines the ontologies used in the query, similar to XML namespaces
with their reference to the used XML schema. The variables after SELECT, which
can either start with ? or $, define the values, which should be returned from
the query. The conditions in the WHERE section have to match the subject
(resource), predicate, object structure of RDF and limit the subset of
returned values. All conditions of the WHERE group must match, while more
complex query structures using sub conditions, filters and optional matches
can be defined. [#]_
RDFS, or RDF schema, describes the vocabulary for the RDF metadata
descriptions. Similar to DTD or XSD for XML, RDFS specifies the terms for
shared communication. An example for such a definition is the above mentioned
Dublin Core. Such a vocabulary can be called an **ontology**. With a proper
ontology the question mentioned above could be phrased in SPARQL::
SELECT ?capital ?country
?x abc:cityname ?capital.
?y abc:countryname ?country.
?x abc:isCapitalOf ?y.
?y abc:isInContinent abc:africa.
Ontologies define sets of concepts and sets of instances of those concepts.
This could be compared to classes and its instances in the object orientated
software development. Each class should describe the denotation of its
instances. There may be inheritance and multiple inheritance between concepts,
like the concept "apple" inherits from the concept "fruit". If the inheritance
relation is strictly hierarchical we talk about taxonomies.
Instances of the concepts, like one given apple "AppleA" may relate to other
instances, like: "AppleA" "inHandOf" "John Smith" by relations. Relations may
exists between instances, concepts and even between instances and concepts.
Ontologies additionally may contain axioms, defining knowledge which is always
true and can not be conducted from the given concepts. One axiom in the
example above could be, that instances of the class apple always fall to the
ground, when they do not relate with "inHandOf" to an instance of the class
"human". Ontologies with contained axioms are known as high-level ontologies.
.. image:: ontology.png
:alt: Example ontology
OWL, [#]_ the Web Ontology Language, is a specification by the W3C to specify
ontologies with a formal description language. The intention is to describe
the terms of an application in a way, that software agents are able to
interpret (understand) the given context.
Like RDF schema OWL bases on the RDF syntax, but the power of its expressions
goes beyond RDF schema. Additionally to RDF and RDF schema it allows to
include expressions like known from first-order logic (predicate logic).
Just like RDF schema OWL reuses the RDF resource, predicate, object triples to
relate classes and instances between each other. The RDF schema predicates are
still part of the OWL definitions and extended by several additional
An incomplete example for a concept which can be described using OWL using its
This defines the concept of non-french wine, by intersecting the concepts wine
with all objects which are not located in France.
The involved technologies differentiate in complexity and usefulness for human
beings and computers.
.. image:: reasoning.png
:alt: Semantic technology relations (not really equidistant)
While HTML is easy to write and even easy to read for humans - especially if
representational directives are assigned to contents - the complexity of OWL
makes it hard to grasp the actual semantics of an instance. On the other hand
OWL is intended to provide enough context to let software perform autarchic
reasoning on the database. Because its complexity it is yet mostly used in
academic research, medical and military applications to model the specific
domain. Rule based semantic reasoners like Bossom exist and may already be
used to used for software reasoning.
Currently the still human readable semantic specification languages like
Microformats and RDF with some known ontologies like Dublin Core seem to be
the most used approaches for intentional semantic markup on the web.
Simplifying the web
A simplification to the above mentioned approaches are custom XML based markup
languages. Those custom markup specifications can be designed to perfectly
match one application. One example is Docbook, [#]_ an advanced XML based
document markup language with semantic markup for several document elements
not included in XHTML.
An example for such a custom markup language for a blog, reusing other already
defined markup languages, could look like::
Some random thoughts
The long way to a semantic web
Such markup is easily interpretable by software, which knows the associated
schema. It is easy to write and develop for the application developers. The
biggest problem remains the maintenance of the associated schema.
Of course this only indirectly implies an ontology, so that no automatic
software reasoning is possible without additional definitions.
Problems embedding other languages
Currently there are several major problems with embedding different XML
dialects into XHTML. Cascading Style Sheets (CSS) in the state currently
supported by the major browsers is not aware of XML namespaces, since this is
only part of CSS 3. This problem is solvable by not using namespace prefixes
for any elements, but only declare namespaces per XML sub tree, like in the
last example. Those elements can then can be addressed normally by CSS, which
currently is totally namespace unaware, by their local names.
XLink [#]_ is intended to be used to define links between documents. Such
links, embedded inside the documents are currently not supported by all
browsers. [#]_ So for some browsers it is not possible to define click-able
links inside custom XML based documents.
There still remains the option of using XSLT to transform the used semantic
markup language to HTML with all of its representational markup, either on the
client or server side. In this case HTML would be misused as a plain
formatting language, which never was its intention and again means violating a
Currently there is nothing like a semantic web, either because of the
complexity of the approaches like RDF schema and OWL, or because of the
limitations imposed by Microformats or especially HTML.
Simple and easy definitions of a markup language, without the dependencies on
languages like XSLT and HTML could help to make it easier for software to
extract data from the web. With optionally associated ontologies, for example
defined using OWL, even software reasoning, like used in artificial
intelligence could be developed for those documents.
.. [#] http://en.wikipedia.org/w/index.php?title=Semantics&oldid=245989124
.. [#] Semantikos, Henry George Liddell, Robert Scott, A Greek-English Lexicon, at Perseus
.. [#] Semaino, Henry George Liddell, Robert Scott, An Intermediate Greek-English Lexicon, at Perseus
.. [#] W3C Semantic Web Activity, http://www.w3.org/2001/sw/
.. [#] First draft of the HTML specification: http://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/MarkUp.html
.. [#] Rationales for table less web design: http://en.wikipedia.org/wiki/Tableless_web_design#Rationale
.. [#] "HTML page analysis based on visual cues" http://research.microsoft.com/asia/dload_files/group/mediasearching/ICDAR-160_yang_y-4th.pdf
.. [#] HTML 5 - W3C Working Draft 22 January 2008 http://www.w3.org/TR/2008/WD-html5-20080122/#relationship0
.. [#] HTML 5 - Differences from HTML 4 http://en.wikipedia.org/wiki/HTML_5#Differences_from_HTML_4
.. [#] http://en.wikipedia.org/wiki/Folksonomy
.. [#] http://en.wikipedia.org/wiki/Taxonomy
.. [#] http://www.ietf.org/rfc/rfc2445.txt
.. [#] http://microformats.org/wiki/rfc-2426
.. [#] http://en.wikipedia.org/wiki/Blogosphere
.. [#] http://developer.yahoo.com/searchmonkey/
.. [#] http://microformats.org/wiki/Main_Page#Drafts
.. [#] http://www.molly.com/2005/03/31/interview-with-hkon-wium-lie/
.. [#] Dublin Core Metadata Initiative (DCMI) http://dublincore.org/
.. [#] SPARQL Query Language for RDF: http://www.w3.org/TR/rdf-sparql-query/
.. [#] OWL Web Ontology Language Overview: http://www.w3.org/TR/owl-features/
.. [#] DocBook: The Definitive Guide: http://docbook.org/tdg/en/html/docbook.html
.. [#] XML Linking Language (XLink) Version 1.0: http://www.w3.org/TR/xlink/
.. [#] XLink browser support: http://kore-nordmann.de/blog/the_long_way_to_semantic_web.html#id6
- Talk about "Semantic web" at PHPUG Cologne on Sun, 24 May 2009 12:55:36 +0200 in Kore Nordmann - PHP / Projects / Politics
I will give a talk about the "Semantic Web" at the PHP UserGroup Cologne on
March 6th. Details are one click away...