Kore Nordmann - PHP / Projects / Politics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :Author: Kore Nordmann :Date: Fri, 11 Dec 2009 10:33:47 +0100 :Revision: 2 :Copyright: CC by-sa ============================ Character sets vs. encodings ============================ :Description: Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points. Todays `PHP Advent article`__ iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points. Character set vs. encoding ========================== The article does not differentiate between the terms character set and encoding, and even calls "UTF-8" a character set. This is wrong. Unicode is a character set and UTF-8 is just an encoding for Unicode, like UTF-16, UTF-32 or UCS4. I understand the reasons for the confusion here, because even popular standards get this wrong, like I `blogged earler`__. My `PHP charset & encoding FAQ`__ describes this in further detail. Basically it is just not sufficient to specify the "character set" for functions like ``htmlentities()``, but you need the specify the _encoding_, since it would be impossible for PHP or any other tool to interpret the byte stream / array properly. mbstring ======== Beside the mbstring-extension PHP has the ``iconv()`` function *enabled by default*, which can not only do recoding of your strings, but also implements basic transliteration, while this depends on your libc and the installed locales on your machine. There is also ``iconv_strlen()`` to check the length of a string in characters (vs. bytes in ``strlen()``). FAQ === For further questions regarding character sets and encodings, there is the already mentioned `PHP charset & encoding FAQ`__ I wrote with the help of others, which should answer most of your questions. If you have feedback or extensions, please don't hesitate to get in contact with me. __ http://phpadvent.org/2009/character-sets-by-paul-reinheimer __ http://kore-nordmann.de/blog/0082_charset_versus_encoding.html __ http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#what-is-the-difference-between-a-charset-and-an-encoding __ http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html .. Local Variables: mode: rst fill-column: 79 End: vim: et syn=rst tw=79 Trackbacks ========== Comments ======== - Artur Ejsmont at Mon, 14 Dec 2009 10:44:17 +0100 To be honest i did not like the article myself either. Had feeling that its too loosely written with no real attempt to explain nor clarify the whole thing. "The mysql_set_charset() function can be used to set the character set of the connection, and mysql_client_encoding() can be used to determine the character set of the current connection" Seriously? :- ) So far, im disappointed with 'Advent' episodes ... seems more like "social promotion point" not real information source. Sorry for hard words, i really appreciate that people contribute knowledge. Maybe i was just expecting too much. Art