Charset vs. Encoding - Kore Nordmann

Charset vs. Encoding

One of the most common errors when dealing with strings is to confuse charset with encoding - and I do understand that very well. So how could that happen?

The terms itself are quite clear, "charset" is a set of characters, which does not imply any encoding. On the other hand we have "encoding", which is commonly used term for internal representations of data, like Wikipedia says:

Encoding is the process of transforming information from one format into another.

Wikipedia

Of course this is also discussed in my Charset/Encoding FAQ.

History

Some time ago we only had singlebyte encodings, where each encoding also defines a charset. The ISO-8859-* charsets or encodings, for example, all contained a specific set of characters, with a mapping to bytes. The only character set common by all charsets from the ISO-8859-* group were the characters contained in ASCII. So there was no real need to differentiate between those terms. This sadly did not only have effect on common texts, but this confusion also made it into several standards.

Already in 1991 Unicode 1.0.0 has been published which defined a far bigger set of characters (charset). With more then 256 characters it was impossible to map those to single bytes any more. For different applications different byte representations (encodings) evolved, for example:

Unicode has been developed to simplify the exchange of data across different languages, which was quite hard with different character sets (ISO-8858-* for example, plus a lot of others), because they all obviously mapped to the same set of bytes but meant completely different characters.

Confusion in standards

Some days ago I blogged about the accept-charset attribute in HTML. Knowing the difference between charset and encoding there is actually a hilarious paragraph in the RFC:

This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received.

HTML 4.01 Specification

The attribute itself is called "accept-charset", but it accepts a list of "character encodings" - not a list of charsets. Don't bother trying to pass something like "Unicode" to that attribute. :)

In the XML head declaration it is one of the first times, where it has been done right, it reads:

<?xml encoding="UTF-8"?> <doc/>

On the other hand there are several other technologies we are daily working with, which also confuse this term, let's take HTTP for example.

In HTTP (1.0 and 1.1) there are two headers, "Accept-Charset" and "Accept-Encoding". The encoding accept header does not specify the character encoding used during the transfer, but real encodings of the whole content, like "gzip", or similar.

"Accept-Charset" was intended to really receive lists of character sets, like shown by this example from the RFC:

Accept-Charset: iso-8859-5, unicode-1-1;q=0.8

Which really specifies "Unicode", which is a character set. But since there are many different encodings used for unicode this does not help the server understanding the response at all. And it is impossible to really decide which encoding has been used by the client.

Some time ago I logged the HTTP Accept-* header on various sites to get test data for a parser, to test the parser not only against the specification, but also against real data. Obviously all browsers send encoding names like "UTF-8" in the Accept-Charset header, but not character set names. From a list of unique Accept-Charset header values, the popularity of the used charset / encoding names:

14 utf-8 13 * 4 iso-8859-1 3 utf-16 3 ISO-8859-1 2 windows-1252 2 windows-1251 1 windows-1250 1 UTF-8 1 koi8-r 1 ISO-8859-2 1 ISO-8859-15 1 cp1251

An interesting side note: A lot of clients claim to accept any ("*") charset / encoding. Obviously they aren't able to decode them properly in any case. IMHO not a really sane default.

Exactly the same is true for the the meta data header element "Accept-Charset" in HTML.

Comments