Tagging - Broken by design
First published at Thursday, 15 November 2007
Warning: This blog post is more then 16 years old – read and use with care.
Tagging - Broken by design
Table of Contents
In the meantime tagging, as a folksonomy, is often nearly used as equivalent to the whole Web-2.0-hype. Nowadays you can't develop any new application without adding tags, and I wanted to do the same. But reviewing the system in a bit more complex environment shows some major flaws, which can also be detected in the implementations of simple applications.
This article lists some structural problems, and will be followed by other articles describing solutions for some of these cases.
Just for the case you did not yet hear about the terms, use this section as a short introduction, so that we are speaking of the same things.
We can see a tag as a string of arbitrary characters assigned to any kind of object in your application. The used characters differ a lot, and they are not relevant for the discussion.
A taxonomy basicaly is a hierarchical structure for any unspecified amount of objects.
Folksonomy is a portmanteau consisting of the words "folk" and "taxonomy", which indicates a taxonomy build by users. As this is often used as an equivalent to tagging, you may just see this as a description of the concept of user structuring contents using tags.
Problems with tagging
From the definition above it sounds as the perfect tool to solve a really complex job of structuring any data and adding semantic meaning to binary data.
Different users - opposing associations
Some projects, like ESP game, are using collaborative tagging with the intersections of tags between users to find tags which are meaningful and can be considered common sense. But what does "common sense" mean in this case? The internet grows but still stays quite close to its anglo american base. - You remember, the original roots at the CERN primeraly established over Europe and the USA.
So, if we start to find "common sense" tags for images, or other binary data, we get a generelized idea of the "folk" visiting the website, which mostly consists of people coming from one language / cultural background.
An extreme example
An example, which should proove this point, because its quite extreme is the tagging of images, which cause emotional reactions for most people. Using some example from the anglo american background is difficult, because of the general availability of TV with shows / series from Europe and the USA nearly everybody on this world is at least a bit influenced by this culture and will probably rate this stuff similarly.
So let's simply consider an image of the president of the USA, George W. Bush, and think about, how it would be tagged by people from different nations - even, when everybody would tag it not in their native language, but english.
The ESP game gives us a nice idea, what even more people agree on tags to use for such images, like the tag "dump" and "yuck" for an image of George W. Bush, as mentioned in this talk. This could of course only happen for people from an anglo american background. There are several people in the world which could tag such an image only by "man", but the tags "bush" or "president" would just provide more information to them. From some cultural backgrounds, like the destitute countries with islamic background, the tags "dump" and "yuck" could even be replaced by tags like "enemy" or "devil" - and also provide some information only relevant for one cultural background - and they strongly conflict with opinions from other users, like the ones who actually elected Bush president.
The assumption in the previous paragraph, that every user tags in english is of course wrong. At least in europe it is absolutely usual to provide websites in more then one language, and in this case not only template strings are required to be translated, but also all content "objects", which include tags.
For example consider wikipedia as an example, where a lot of articles are translated and exist in different languages and are also cross linked. Currently the only real structure on wikipedia is given by the document cross references, which could be described by a graph, a common data structure in computer science. There are also manually maintained categories, but it is really a lot of work to maintain them - and they are maintained seperately for each language.
Without going into detail on categories now and here, let's consider tags to structure this content in a multilingual environment and provide a more accessible structure to those contents (list or tree). In a website where contents and strings are in some non-english language you of course can't expect users to use english for tagging. There may be a lot of users which don't know english at all, or just don't feel confident reading / writing in english.
So we end up with tags in different languages, which really cause problems when we want to use them for content structuring. Translations are not bijective, as you may have noticed, when you learned some language. A nice example for this is a translation path like:
friends <-> Freunde <-> company <-> Unternehmen <-> operation
Where "operation" is obviously no synonym for "friends" in english. You may find thousands of other similar examples.
The same tag may mean something completely different depending on its context. In the example above, an image of George W. Bush has been tagged with "bush" by the ESP game. The term "bush" is also used as a category of plants. If some user now clicks on the tag "bush", he probably would only expect images with one of those two meanings of the tags.
There are people which work on this, as you can see in this google talk, but this is also something you need to respect when building hierarchic structures from tags. A short summary: You may get an idea of the real meaning of a tag from the tags, which are also used for some content object. This aggregated "metadata" may be used to put the object at the right place in the hierarchy. Or you know in which context the user acts, so you know which meaning of the tags (s)he looks for. Of course there are several possibilities you could get this context from, like prior queries of the user, or let him define the context somehow, for example by additional tags.
Basically - as described in the previous paragraph - you have a graph of content objects, each associated with a list of tags. Once you got enough data and a lot of tags you will want to provide hierarchical structures to browse the tags. It may not be obvious, but this is possible - even it will get a lot harder, if you try to respect the problems described in the last two paragraphs.
There are two ways, for each of them an algorithm already exists, and I will try to create links to corresponding publications soon, in a dedicated section in the near future.
Reuse user defined hierarchies.
If some users define personal hierarchies for some tags, you can calculate a global hierarchy from this - with more users this will get more and more complete and accurate.
Consider often used tags as categories
You may use tags, which are used very often as a category and group the less often used tags in there. This is not easy, and you may need lots of tags to get some valid results. From the example above: "man" will occure far more often than "president", which will be used for more different images than "george w. bush" - so you might get this kind of hierarchical structure.
OK, this was just a list of problems - but I do not only want to rant here. I will post follow ups to this article about possible solutions, or more details on already mentioned solutions.
There will also be another blog post about machine tags and other existing similar predefined ontologies.
And to finish this short introduction - if you find any good material on tagging and related topic, I would be glad, if you could send me a link or pointer to email@example.com.
I already deleted comments which started to discuss George W. Bush, his election or politic. It is just a random example, which works well because of his worldwide popularity. I do not want to state any opinion in this particular blog post concerning this - and I really do not want any discussion on this, and will continue to delete all such comments.