Character sets vs. encodings - Kore Nordmann

Character sets vs. encodings

Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points.

Character set vs. encoding

The article does not differentiate between the terms character set and encoding, and even calls "UTF-8" a character set. This is wrong. Unicode is a character set and UTF-8 is just an encoding for Unicode, like UTF-16, UTF-32 or UCS4.

I understand the reasons for the confusion here, because even popular standards get this wrong, like I blogged earler. My PHP charset & encoding FAQ describes this in further detail.

Basically it is just not sufficient to specify the "character set" for functions like htmlentities(), but you need the specify the _encoding_, since it would be impossible for PHP or any other tool to interpret the byte stream / array properly.

mbstring

Beside the mbstring-extension PHP has the iconv() function enabled by default, which can not only do recoding of your strings, but also implements basic transliteration, while this depends on your libc and the installed locales on your machine. There is also iconv_strlen() to check the length of a string in characters (vs. bytes in strlen()).

FAQ

For further questions regarding character sets and encodings, there is the already mentioned PHP charset & encoding FAQ I wrote with the help of others, which should answer most of your questions. If you have feedback or extensions, please don't hesitate to get in contact with me.

Comments