Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points.
The article does not differentiate between the terms character set and encoding, and even calls "UTF-8" a character set. This is wrong. Unicode is a character set and UTF-8 is just an encoding for Unicode, like UTF-16, UTF-32 or UCS4.
Basically it is just not sufficient to specify the "character set" for functions like
htmlentities(), but you need the specify the _encoding_, since it would be impossible for PHP or any other tool to interpret the byte stream / array properly.
Beside the mbstring-extension PHP has the
iconv() function enabled by
default, which can not only do recoding of your strings, but also implements basic transliteration, while this depends on your libc and the installed locales on your machine. There is also
iconv_strlen() to check the length of a string in characters (vs. bytes in
For further questions regarding character sets and encodings, there is the already mentioned PHP charset & encoding FAQ I wrote with the help of others, which should answer most of your questions. If you have feedback or extensions, please don't hesitate to get in contact with me.
Comments are closed. This blog only exists so that all articles can still be referenced. There is no relevant activity any more on this blog. Since spammers still also find this blog comments are shut down entirely.