Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points.
The article does not differentiate between the terms character set and encoding, and even calls "UTF-8" a character set. This is wrong. Unicode is a character set and UTF-8 is just an encoding for Unicode, like UTF-16, UTF-32 or UCS4.
I understand the reasons for the confusion here, because even popular standards get this wrong, like I blogged earler. My PHP charset & encoding FAQ describes this in further detail.
Basically it is just not sufficient to specify the "character set" for functions like htmlentities()
, but you need the specify the _encoding_, since it would be impossible for PHP or any other tool to interpret the byte stream / array properly.
Beside the mbstring-extension PHP has the iconv()
function enabled by
default, which can not only do recoding of your strings, but also implements basic transliteration, while this depends on your libc and the installed locales on your machine. There is also iconv_strlen()
to check the length of a string in characters (vs. bytes in strlen()
).
For further questions regarding character sets and encodings, there is the already mentioned PHP charset & encoding FAQ I wrote with the help of others, which should answer most of your questions. If you have feedback or extensions, please don't hesitate to get in contact with me.
Comments are closed. This blog only exists so that all articles can still be referenced. There is no relevant activity any more on this blog. Since spammers still also find this blog comments are shut down entirely.
Artur Ejsmont at Mon, 14 Dec 2009 10:44:17 +0100
To be honest i did not like the article myself either. Had feeling that its too loosely written with no real attempt to explain nor clarify the whole thing.
Link to comment"The mysql_set_charset() function can be used to set the character set of the connection, and mysql_client_encoding() can be used to determine the character set of the current connection"
Seriously? :- )
So far, im disappointed with 'Advent' episodes ... seems more like "social promotion point" not real information source.
Sorry for hard words, i really appreciate that people contribute knowledge. Maybe i was just expecting too much.
Art