Character sets vs. encodings

First published at Friday, 11 December 2009

Warning: This blog post is more then 15 years old – read and use with care.

Character sets vs. encodings

Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points.

Character set vs. encoding

The article does not differentiate between the terms character set and encoding, and even calls "UTF-8" a character set. This is wrong. Unicode is a character set and UTF-8 is just an encoding for Unicode, like UTF-16, UTF-32 or UCS4.

I understand the reasons for the confusion here, because even popular standards get this wrong, like I blogged earler. My PHP charset & encoding FAQ describes this in further detail.

Basically it is just not sufficient to specify the "character set" for functions like htmlentities(), but you need the specify the _encoding_, since it would be impossible for PHP or any other tool to interpret the byte stream / array properly.

mbstring

Beside the mbstring-extension PHP has the iconv() function enabled by default, which can not only do recoding of your strings, but also implements basic transliteration, while this depends on your libc and the installed locales on your machine. There is also iconv_strlen() to check the length of a string in characters (vs. bytes in strlen()).

FAQ

For further questions regarding character sets and encodings, there is the already mentioned PHP charset & encoding FAQ I wrote with the help of others, which should answer most of your questions. If you have feedback or extensions, please don't hesitate to get in contact with me.


Comments

Artur Ejsmont at Monday, 14.12. 2009

To be honest i did not like the article myself either. Had feeling that its too loosely written with no real attempt to explain nor clarify the whole thing.

"The mysql_set_charset() function can be used to set the character set of the connection, and mysql_client_encoding() can be used to determine the character set of the current connection"

Seriously? :- )

So far, im disappointed with 'Advent' episodes ... seems more like "social promotion point" not real information source.

Sorry for hard words, i really appreciate that people contribute knowledge. Maybe i was just expecting too much.

Art

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog: