Character sets vs. encodings
First published at Friday 11 December 2009
Warning: This blog post is more then 15 years old – read and use with care.
Character sets vs. encodings
Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points.
Character set vs. encoding
The article does not differentiate between the terms character set and encoding, and even calls "UTF-8" a character set. This is wrong. Unicode is a character set and UTF-8 is just an encoding for Unicode, like UTF-16, UTF-32 or UCS4.
I understand the reasons for the confusion here, because even popular standards get this wrong, like I blogged earler. My PHP charset & encoding FAQ describes this in further detail.
Basically it is just not sufficient to specify the "character set" for functions like htmlentities()
, but you need the specify the _encoding_, since it would be impossible for PHP or any other tool to interpret the byte stream / array properly.
mbstring
Beside the mbstring-extension PHP has the iconv()
function enabled by
default, which can not only do recoding of your strings, but also implements basic transliteration, while this depends on your libc and the installed locales on your machine. There is also iconv_strlen()
to check the length of a string in characters (vs. bytes in strlen()
).
FAQ
For further questions regarding character sets and encodings, there is the already mentioned PHP charset & encoding FAQ I wrote with the help of others, which should answer most of your questions. If you have feedback or extensions, please don't hesitate to get in contact with me.
Subscribe to updates
There are multiple ways to stay updated with new posts on my blog:
Comments
Artur Ejsmont at Monday, 14.12. 2009
To be honest i did not like the article myself either. Had feeling that its too loosely written with no real attempt to explain nor clarify the whole thing.
"The mysql_set_charset() function can be used to set the character set of the connection, and mysql_client_encoding() can be used to determine the character set of the current connection"
Seriously? :- )
So far, im disappointed with 'Advent' episodes ... seems more like "social promotion point" not real information source.
Sorry for hard words, i really appreciate that people contribute knowledge. Maybe i was just expecting too much.
Art