First published at Thursday, 15 November 2007

Warning: This blog post is more then 15 years old – read and use with care.

Tagging - Broken by design

In the meantime tagging, as a folksonomy, is often nearly used as equivalent to the whole Web-2.0-hype. Nowadays you can't develop any new application without adding tags, and I wanted to do the same. But reviewing the system in a bit more complex environment shows some major flaws, which can also be detected in the implementations of simple applications.

This article lists some structural problems, and will be followed by other articles describing solutions for some of these cases.

Definitions

Just for the case you did not yet hear about the terms, use this section as a short introduction, so that we are speaking of the same things.

Tag

We can see a tag as a string of arbitrary characters assigned to any kind of object in your application. The used characters differ a lot, and they are not relevant for the discussion.

Taxonomy

A taxonomy basicaly is a hierarchical structure for any unspecified amount of objects.

Folksonomy

Folksonomy is a portmanteau consisting of the words "folk" and "taxonomy", which indicates a taxonomy build by users. As this is often used as an equivalent to tagging, you may just see this as a description of the concept of user structuring contents using tags.

Problems with tagging

From the definition above it sounds as the perfect tool to solve a really complex job of structuring any data and adding semantic meaning to binary data.

Different users - opposing associations

Some projects, like ESP game, are using collaborative tagging with the intersections of tags between users to find tags which are meaningful and can be considered common sense. But what does "common sense" mean in this case? The internet grows but still stays quite close to its anglo american base. - You remember, the original roots at the CERN primeraly established over Europe and the USA.

So, if we start to find "common sense" tags for images, or other binary data, we get a generelized idea of the "folk" visiting the website, which mostly consists of people coming from one language / cultural background.

An extreme example

An example, which should proove this point, because its quite extreme is the tagging of images, which cause emotional reactions for most people. Using some example from the anglo american background is difficult, because of the general availability of TV with shows / series from Europe and the USA nearly everybody on this world is at least a bit influenced by this culture and will probably rate this stuff similarly.

So let's simply consider an image of the president of the USA, George W. Bush, and think about, how it would be tagged by people from different nations - even, when everybody would tag it not in their native language, but english.

The ESP game gives us a nice idea, what even more people agree on tags to use for such images, like the tag "dump" and "yuck" for an image of George W. Bush, as mentioned in this talk. This could of course only happen for people from an anglo american background. There are several people in the world which could tag such an image only by "man", but the tags "bush" or "president" would just provide more information to them. From some cultural backgrounds, like the destitute countries with islamic background, the tags "dump" and "yuck" could even be replaced by tags like "enemy" or "devil" - and also provide some information only relevant for one cultural background - and they strongly conflict with opinions from other users, like the ones who actually elected Bush president.

Tag i18n

The assumption in the previous paragraph, that every user tags in english is of course wrong. At least in europe it is absolutely usual to provide websites in more then one language, and in this case not only template strings are required to be translated, but also all content "objects", which include tags.

For example consider wikipedia as an example, where a lot of articles are translated and exist in different languages and are also cross linked. Currently the only real structure on wikipedia is given by the document cross references, which could be described by a graph, a common data structure in computer science. There are also manually maintained categories, but it is really a lot of work to maintain them - and they are maintained seperately for each language.

Without going into detail on categories now and here, let's consider tags to structure this content in a multilingual environment and provide a more accessible structure to those contents (list or tree). In a website where contents and strings are in some non-english language you of course can't expect users to use english for tagging. There may be a lot of users which don't know english at all, or just don't feel confident reading / writing in english.

So we end up with tags in different languages, which really cause problems when we want to use them for content structuring. Translations are not bijective, as you may have noticed, when you learned some language. A nice example for this is a translation path like:

friends <-> Freunde <-> company <-> Unternehmen <-> operation

Where "operation" is obviously no synonym for "friends" in english. You may find thousands of other similar examples.

Ambigous meanings

The same tag may mean something completely different depending on its context. In the example above, an image of George W. Bush has been tagged with "bush" by the ESP game. The term "bush" is also used as a category of plants. If some user now clicks on the tag "bush", he probably would only expect images with one of those two meanings of the tags.

There are people which work on this, as you can see in this google talk, but this is also something you need to respect when building hierarchic structures from tags. A short summary: You may get an idea of the real meaning of a tag from the tags, which are also used for some content object. This aggregated "metadata" may be used to put the object at the right place in the hierarchy. Or you know in which context the user acts, so you know which meaning of the tags (s)he looks for. Of course there are several possibilities you could get this context from, like prior queries of the user, or let him define the context somehow, for example by additional tags.

Tag dependencies

Basically - as described in the previous paragraph - you have a graph of content objects, each associated with a list of tags. Once you got enough data and a lot of tags you will want to provide hierarchical structures to browse the tags. It may not be obvious, but this is possible - even it will get a lot harder, if you try to respect the problems described in the last two paragraphs.

There are two ways, for each of them an algorithm already exists, and I will try to create links to corresponding publications soon, in a dedicated section in the near future.

  • Reuse user defined hierarchies.

    If some users define personal hierarchies for some tags, you can calculate a global hierarchy from this - with more users this will get more and more complete and accurate.

  • Consider often used tags as categories

    You may use tags, which are used very often as a category and group the less often used tags in there. This is not easy, and you may need lots of tags to get some valid results. From the example above: "man" will occure far more often than "president", which will be used for more different images than "george w. bush" - so you might get this kind of hierarchical structure.

Summary

OK, this was just a list of problems - but I do not only want to rant here. I will post follow ups to this article about possible solutions, or more details on already mentioned solutions.

There will also be another blog post about machine tags and other existing similar predefined ontologies.

And to finish this short introduction - if you find any good material on tagging and related topic, I would be glad, if you could send me a link or pointer to tagging@kore-nordmann.de.

Note

I already deleted comments which started to discuss George W. Bush, his election or politic. It is just a random example, which works well because of his worldwide popularity. I do not want to state any opinion in this particular blog post concerning this - and I really do not want any discussion on this, and will continue to delete all such comments.

Comments

Rick at Friday, 16.11. 2007

Looking forward to getting those suggestions soon. I'm currently working on a social networking site, making some attempt to use user submitted "tags" in our search solution.

Unfortunately it has started to get a bit more complicated, in that the weight of a given tag now has to be calculated partially on how many people submit that tag... and then there's stemming of the terms, or not to stem... track locale information, or don't track locale information... and the list goes on. It's requiring constant tweaking to get things to essentially be evenly generic and incorrect.

fa at Friday, 16.11. 2007

These points are valid, but already a bit too highlevel. If you want to find flaws, it's much easier.

What annoys me most in the first place are names. Let's say I attended some "Foo University" - then ~20% of all web apps won't allow the whitespace. Of the remaining 80% I bet that people will choose "University of Foo" or "Foo State University" or even "FU Mycity" about half of the time. Without a manual /semi-manual sorting you'll end up with quite useless tags, because the tagged items are split into 2 categories with the exact same meaning.

But it's not only names, just take a website about the upcoming video game "Especially Boring Soccer 2" or EBS2 for short. Possible tags now include: ebs, ebs2, ebsii, "especially boring soccer", "especially boring soccer2","especially boring soccer 2", game, games, videogame,videogames Yes, not even the name, but already the genre can lead to a widespread mis-tagging and until now I have seen quite a lot of websites that won't allow some sort of google suggest to tell you what similar tags are already known to the system.

kore at Friday, 16.11. 2007

@fa: You're right, but:

1) The whitespace issue is just an issue with broken applications - no structural / general problem.

2) This is fixable by user defined or predefined synonyms, which are just replaced - until you get to abbreviations. I first thought about including this in this text, but it is really similar to the translation thingy - especially when you remember, that the same abbreviation may mean completely different things in different languages / cultures.

kore at Friday, 16.11. 2007

@fa: One more note: the game / videogame stuff is just what I said about categories. You may want to get both (perhaps after some time, with other users fixing tags, etc.) to build up categories and use game as a category for videogames and other games.

Daemon at Saturday, 17.11. 2007

Tags used in wide audience and generated by wide audience often fail horribly. However, not all technologies are meant to be used by everyone. One of them is Tag.

When making, for example, news site (daily newspapers online) tags are used to quickly "describe" an news article journalist is writing. A professional person is doing it (and some of words inside article can be used as keywords and tags as well). Those tags are later used to link other articles, based on tag relevancy.