Kore Nordmann - PHP / Projects / Politics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :Author: Kore Nordmann :Date: Tue, 13 Oct 2009 17:51:32 +0200 :Revision: 29 :Copyright: CC by-sa =============== PHP Charset FAQ =============== :authors: - Kore Nordmann :license: CC-by-sa :Description: This is list of frequently asked questions about charsets and encodings within the PHP ecosystem. It also answers some general charset and encoding questions, as it answers HTTP related questions. If you know about other related common questions feel free to send it to one of the authors. .. contents:: List of questions :depth: 2 .. note:: If the FAQ was helpful to you, you can order me a thank you here: http://wishlist.kore-nordmann.de/ General ======= What is the difference between unicode and UTF-8/UTF-16/...? ------------------------------------------------------------ Unicode is a *charset*, which means just a set of characters, which says nothing about how the characters are actually stored (mapped to bytes). `UTF-8`__ / `UTF-16`__ / ... are *encodings* which define how a character is mapped to bytes in a string or byte array. Between UTF-8, UTF-16 and UTF-32 basically the amount of bytes used to encode some character differs. UTF-8 uses 1 byte for the characters defined in ASCII__, and a dynamic width of two to four for other characters. UTF-32 constantly uses four bytes for each character which makes iterating over characters in a string trivial, but consumes much more space for common strings. Choosing the correct default encoding for your application is not trivial and depends on common strings and common string usage in your application. See also: - `What is the difference between a charset and an encoding?`_ __ http://en.wikipedia.org/wiki/UTF-8 __ http://en.wikipedia.org/wiki/UTF-16 __ http://en.wikipedia.org/wiki/ASCII What is the difference between a charset and an encoding? --------------------------------------------------------- A charset is a set of characters which can be represented in a certain encoding. The encoding actually defines which bytes are used for a certain character. An example: The character '☯' is available in the unicode *charset*, and probably other charsets, too. But there are different *encodings* for this character, even for the same charset, like the following:: Unicode character: ☯ UTF-8 encoded: 0xE2 0x98 0xAF UTF-16 encoded: 0x26 0x2F What is the difference between a character and a byte? ------------------------------------------------------ Generally a byte, or a sequence of bytes, ist just an internal representation of a character depending on the used encoding. The encoding maps characters to bytes or byte sequences. In singlebyte encodings each character in the charset maps to exactly one byte, so bytes and characters can actually be confused, because they always represent just the same. On the other hand multibyte encodings like UTF-8 or UTF-16 are nowadays more common and map characters to multiple bytes, so that different character representations contain the same bytes:: ⅱ, UTF-8 encoded: 0xE2 0x85 0xB1 ⅲ, UTF-8 encoded: 0xE2 0x85 0xB2 As you can see, the first two bytes used for both characters are the same, while only the third byte differs. - `What is the difference between a charset and an encoding?`_ - `What does "multibyte charset/encoding" mean?`_ How do I determine the charset/encoding of a string? ---------------------------------------------------- There is no way to do this right. For example all ISO 8859-* encodings work on all combinations of bytes, so that there is no way to know about the used encoding. You may guess the encoding, if you know the contents of a string, like detecting multiple expected occurrences of some not common characters. UTF-8 multibyte character sequences do have some characteristics you may check for, but each UTF-8 string may also be an ISO 8859-* string. To check if a string is a valid sequence of UTF-8 encoded characters you could use the following regular expression, but this won't actually tell you, if the string *is* UTF-8, it still might be in nearly any other encoding:: (^(?: [\x00-\x7f] | [\xc0-\xdf][\x80-\xff] | [\xe0-\xef][\x80-\xff]{2} | [\xf0-\xf7][\x80-\xff]{3} )*$)x Don't use it on big strings though, it may crash PCRE. What does "multibyte charset/encoding" mean? -------------------------------------------- A multibyte charset uses not only one but multiple bytes for one character. The amount of bytes used for one character may be dynamic, like in UTF-8 and UTF-16, or fixed like in UTF-32. In a singlebyte encoding only 256 different characters can be represented, as this is the number of different values one byte can have (2^8). This number of characters is not sufficient for lots of languages, and especially not when you try to fit characters of multiple languages in one charset and encoding (unicode and UTF-\*). See also: - `What is the difference between unicode and UTF-8/UTF-16/...?`_ What does string transliteration mean? -------------------------------------- When converting between different charsets it may happen that not all characters of the source string are available in the destination charset. In this case transliteration aims to provide another character or sequence of characters, which sufficiently replace the source character in the destination charset. Common transliterations for the german umlaut ä may be:: ä => ae ä => a Transliteration in PHP ^^^^^^^^^^^^^^^^^^^^^^ In PHP you can transliterate strings using different functions. 1) The iconv() function supports very basic transliteration depending on the installed locales. Unknown characters are transliterated, when you append the string //TRANSLIT to the destination encoding, like shown in the conversion example: `How do I change the encoding of a string?`_. 2) The extension `pecl/translit`__ offers transliterations between several charsets, not depending on installed locales on your system. Check out its documentation__ for details. __ http://pecl.php.net/package/translit __ http://derickrethans.nl/translit.php HTTP related ============ Why do I have such strange characters on my website? ---------------------------------------------------- The content you send is encoded in a different encodings than specified for the client, or than the client detects. When speaking of websites we talk about browsers in most cases, which determine the encoding of a website basing on two factors: - The Content-Type header send by the webserver - The content-type meta tag in the (X)HTML header To ensure that the browser detects the correct charset and encoding of your website you should set both to the same value. The content type header sent by your website could either be configured in the web server configuration, or sent explicitly by PHP, using something like:: header( 'Content-type: text/html; charset=utf-8' ); See also: - `How do I send the correct charset/encoding for $client?`_ - `How do I determine the charset/encoding of a string?`_ - `Why shouln't I use htmlentities?`_ Which charset does $client send? -------------------------------- In most cases the client sends the encoding specified for the website. This works only if the encoding could be determined doubtless by the browser. If the browser can't detect the specified charset, you cannot really know about the charset of the input strings. This is one reason you should respect the HTTP header "Accept-Encoding", even most browsers nowadays know about UTF-8. In every case there may be misbehaving browsers, or clients which just try to feed your application with invalid data. That's why you should try to gracefully handle strings, which do not match your expected encoding. How do I send the correct charset/encoding for $client? ------------------------------------------------------- Most HTTP clients send a header with a list of encodings/charsets they can understand and prefer. The header is called `Accept-Charset`__ and is available in PHP in ``$_SERVER['HTTP_ACCEPT_CHARSET']``. Most clients actually send lists of encodings they understand instead of lists of charset. A typical header may look like:: utf-8;q=1.0, windows-1251;q=0.8, cp1251;q=0.8, koi8-r;q=0.8, *;q=0.5 The header tells, that the client likes UTF-8 most (which is an encoding), and thinks it can also handle all kinds of charsets / encodings with a slight preference on windows-1251, cp1251 and koi8-r. Nowadays most clients can handle UTF-8 encoded content - for other clients you should either use plain `UTF-7`__, which should work in most cases, or transform the contents to the requested encoding before sending. There is also a HTTP header Accept-Encoding, which actually does not have anything to do with the encodings we talk about in this FAQ, but contains a list of usable compression formats. __ http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.2 __ http://en.wikipedia.org/wiki/UTF-7 How do I ensure the correct overall charset/encoding in my web application? --------------------------------------------------------------------------- You need to ensure that you know the correct charset at every point of your application. For all input strings you should recode them directly when receiving - maybe at a central input handler where you convert everything to the charset you use consistently in your application. For this read: `Which charset does $client send?`_. With every content following a defined encoding you should pay attention that the backend uses the same encoding and also returns this. Check the `databases`_ section for this, depending on the type of backend you use. With every content in your defined encoding you should also ensure that you set this encoding in the output. Check `How do I send the correct charset/encoding for $client?`_ for details. Does accept-charset help in HTML forms? --------------------------------------- Short answer: No. HTML forms may define the attribute accept-charset, which tells the browser which encoding to use when sending data to the server. The funny thing is, that it breaks horribly and is not handled properly by any browser. A form using this might look like::
Test form using $encoding
Testing this with different browsers on a website, which is encoded using UTF-8, this causes the following results with different browsers. The "acccept-charset" column contains the value, specified in the accept-value attribute of the form. The "Contained string" column shows the string passed in by the browser for the hiden field, while the "Pasted string" is something the user entered. ============= ============== ================== ============== ========== Browser accept-charset Contained string Pasted string Received ============= ============== ================== ============== ========== Firefox/3.0.3 ISO 8859-15 ��� ����☠ ISO, entity Firefox/3.0.3 UTF-16 öäü öäüß☠ UTF-8 Firefox/3.0.3 UTF-8 öäü öäüß☠ UTF-8 Opera/9.61 ISO 8859-15 ��� ����☠ ISO, entity Opera/9.61 UTF-16 öäü öäüß☠ UTF-8 Opera/9.61 UTF-8 öäü öäüß☠ UTF-8 Konqueror/4.1 ISO 8859-15 ��� ����☠ ISO, entity Konqueror/4.1 UTF-16 ∅ ∅ Nothing Konqueror/4.1 UTF-8 öäü öäüß☠ UTF-8 MSIE 7.0 ISO 8859-15 öäü öäüß☠ UTF-8 MSIE 7.0 UTF-16 öäü öäüß☠ UTF-8 MSIE 7.0 UTF-8 öäü öäüß☠ UTF-8 ============= ============== ================== ============== ========== Firefox and Opera at least send the correct values for the ISO 8859-15 encoding, while characters, which are not contained in the associated charset are represented by their decimal XML entites - which is OK. When requesting the value as UTF-16, no browser does it right, Konqueror 4.1 does not even send anything at all, all other browsers just keep sending UTF-8 strings. The Internet Explorer completely ignores the accept-charset="" attribute and always sends the values in the defined document encoding. Generally speaking: It will not hurt, if you define the accept-charset with the same encoding you use on the site anyways, but it will not buy you anything either. Defining an encoding which differs from the site encoding does not work consistently across different browsers. PHP related =========== Which charset/encoding do strings have in PHP? ---------------------------------------------- PHP 5 ^^^^^ Short answer: None. Longer answer: PHP does not maintain any charset information for strings. Strings are just arrays of bytes in PHP. This is especially problematic for multibyte encodings, because there is strict difference between character and byte context, while a character always equals a byte in singlebyte encodings. You should try to maintain the information about the encoding of a string yourself. The easiest way is to ensure that only one single encoding is used in your complete application / framework / backend. See also: - `How do I iterate characterwise over a string?`_ - `Which charset does $client send?`_ - `How do I ensure the correct overall charset/encoding in my web application?`_ - `How do I determine the charset/encoding of a string?`_ - `What does "multibyte charset/encoding" mean?`_ How do I change the encoding of a string? ----------------------------------------- PHP 5 ^^^^^ There are several functions which offer easy encoding conversions, but there are also some stumbling blocks you should remember. The basic conversion for example may be done by the functions `iconv()`__ or `mb_convert_encoding()`__, but the following problems may occur: - The destination encoding only implements a charset subset of the source encoding - The source string contains invalid data. Not all sequences of bytes are valid in each encoding. The iconv() function knows the two parameters //TRANSLIT and //IGNORE, which are appended to the destination encoding string, which define how to handle missing characters in the destination encoding. An example could look like:: Invalid characters lead to warnings during conversions. __ http://docs.php.net/manual/en/function.iconv.php __ http://docs.php.net/manual/en/function.mb-convert-encoding.php utf8_(de|en)code ~~~~~~~~~~~~~~~~ There is special handler for ISO 8859-1 <-> UTF-8 en- and decoding in PHP, the two functions `utf8_decode()`__ and `utf8_encode()`__. utf8_decode() converts UTF-8 encoded strings to ISO 8859-1, while utf8_encode() does the opposite:: Characters, which have no equivalent in the charset which can be encoded by ISO 8859-1 are replaced by question marks. These functions are therefore quite useless, since they only know about two encodings, while ISO 8859-1 can only encode 256 characters, which is a *really* small subset of unicode. You should always use iconv() instead, since you have far more control over the conversion this way. See also: - `What does string transliteration mean?`_ __ http://docs.php.net/manual/en/function.utf8-decode.php __ http://docs.php.net/manual/en/function.utf8-encode.php How do I iterate characterwise over a string? --------------------------------------------- PHP 5 ^^^^^ This is a bit harder then you might think at a first glance. First: You need to know the actual encoding of the string you want to iterate over. If it is a singlebyte encoding like ISO 8859-* you can just iterate bytewise over the string - see the next question for details on this. If it is a multibyte encoding there are basically three options for you: 1) Convert to a encoding which uses the same amount of bytes for each character. You should choose the encoding you transform to depending on the charset of your strings. Safe for most cases are the four byte encodings UCS4 or UTF-32, which also can encode the full unicode charset, but use constantly four bytes per character, which makes iterating over the string trivial. You might be able to convert your string to a singlebyte encoding like ISO 8859-1, if this covers the complete used charset. Beware of possible information loss here. 2) You can use methods of some multibyte extensions to extract string parts, for example `mb_substr()`__. 3) Respect the encoding specific multibyte characteristics while scanning the string. This depends a lot on the encoding you use and involves lots of checks. So this might be the slowest solution. An example using iconv for conversion for the first method could look like:: As you can see from the output we iterated characterwise over the string and shouldn't have any loss of information, as both encodings are defined for the same charset. See also: - `Which charset/encoding do strings have in PHP?`_ - `How do i iterate bytewise over a string?`_ - `How do I change the encoding of a string?`_ - `How do I determine the charset/encoding of a string?`_ - `What is the difference between a charset and an encoding?`_ __ http://docs.php.net/manual/en/function.mb-substr.php How do i iterate bytewise over a string? ---------------------------------------- PHP 5 ^^^^^ That's easy. As strings do not have any associated charset or encoding in PHP, but are only arrays of bytes, the common way iterating over a string is the correct way, like this example shows:: Which echos 14 hexadecimal numbers, including the three numbers for the last character which is a UTF-8 multibyte character. See also: - `How do I iterate characterwise over a string?`_ - `What does "multibyte charset/encoding" mean?`_ How do I determine the length of a string? ------------------------------------------ PHP 5 ^^^^^ Since PHP does not have any charset or encoding associated with strings, you can only easily check for the number of bytes a string consists of. This is, what strlen() does for you. If you know that the encoding used for the string is a singlebyte encoding, like ISO 8859-\*, you can also just use strlen(), as each byte maps to exactly one character. For multibyte encodings this approach does not work well. Here basically the same applies as for iterating over the characters of a string: 1) You might want to convert (or use only) encodings with a fixed number of bytes for a character in your application. Using this you can just devide the value returned by strlen() by the mapping factor - in case of UTF-32 this would be 4, for example. 2) In the charset handling extensions there are special functions which help you determine the number of characters in a string depending on the encoding. These are: - int `iconv_strlen`__ ( string str [, string charset] ) - int `mb_strlen`__ ( string str [, string encoding] ) See also: - `Which charset/encoding do strings have in PHP?`_ - `How do I iterate characterwise over a string?`_ - `How do i iterate bytewise over a string?`_ - `How do I change the encoding of a string?`_ - `How do I determine the charset/encoding of a string?`_ __ http://docs.php.net/manual/en/function.iconv-strlen.php __ http://docs.php.net/manual/en/function.mb-strlen.php Why shouln't I use htmlentities? -------------------------------- PHP 5 ^^^^^ Even you ensure the correct encoding throughout your complete application you end up with strange characters on your website? In this case this normally means that the function used for escaping your output was not aware of the provided charset. In case of XML / (X)HTML this probably mean, that you ignored the third parameter of the `htmlspecialchars()`__ function, which allows you to specify the encoding of the string, which is especially important for multibyte encodings like UTF-8, as you can see in the following example:: For UTF-8 this should not happen with plain htmlspecialchars() - because the characters, which are escaped by htmlspecialchars() are not part of multibyte characters in UTF-8, but it may happen with other multibyte encodings. When the characters are available in the characterset / encoding reported to the client, you don't actually need the conversion to entities of special characters. Subsumption: 1) Use htmlspecialchars(), because it is enough if you set the character in the client. 2) Specify your used encoding for htmlspecialchars(), otherwise it might mess up your characters. See also: - `What does "multibyte charset/encoding" mean?`_ - `Why do I have such strange characters on my website?`_ - `How do I send the correct charset/encoding for $client?`_ - `How do I ensure the correct overall charset/encoding in my web application?`_ __ http://docs.php.net/manual/en/function.htmlspecialchars.php Which encoding should I use for my source files? ------------------------------------------------ PHP 5 ^^^^^ Each ASCII__ compatible encoding will do the job. All language constructs in PHP only use the characters defined in ASCII, which are the contained and mapped to the same bytes for all ISO 8859-* charsets/encodings and in UTF-8. For your own (variable, class & function) names you may also use bytes with a value greater the 0x7f, which means that you can embed special characters, and even use the full Unicode charset with UTF-8 for your function names. Since PHP function names are handled binary-safe this would mean, that you need to ensure all your files are edited with the same encoding configured in your editor. You can specify the encoding of your PHP files explicitly, but this has different issues in different versions of PHP. See the manual__ for details. __ http://en.wikipedia.org/wiki/ASCII __ http://php.net/manual/en/control-structures.declare.php#control-structures.declare.encoding Databases ========= How to ensure using the right encoding in my MySQL database? ------------------------------------------------------------ For MySQL there is only one thing relevant to maintain the correct encoding of your content: The client / connection encoding ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can set this either globally in your client configuration file, which should reside somewhere like ``/etc/mysql/my.cnf``. There you should add the following lines:: [mysql] default-character-set=utf8 Where "utf8" is your desired encoding. You can also set the connection charset only for one connection, by sending the following query before you send data to the database server:: SET NAMES utf8 Database / table encoding ^^^^^^^^^^^^^^^^^^^^^^^^^ The database and table encoding only defines, how the data is actually stored in MySQL and does not relate to the encoding returned to you when querying the data. Everything should be fine if you specified `The client / connection encoding`_ correctly and the table and database encoding can handle the same charset like your application uses. For columns / tables, where you don't need the full unicode charset you should use a minimal charset to keep the amount of used memory and keysizes low. See also: - `What is the difference between a charset and an encoding?`_ - `Which charset does $client send?`_ - `MySQL 5.0 Reference Manual :: 9.1 Character Set Support`__ __ http://dev.mysql.com/doc/refman/5.0/en/charset.html How to ensure using the right encoding in my PostgreSQL database? ----------------------------------------------------------------- Author Malte Schirmacher You can set this globally in your server configuration file, which should reside somewhere like /etc/postgresql//main/postgresql.conf. There you can add the following line:: client_encoding = utf8 If this option is not set the encodings defaults to the database-encoding. You can also specify the client encoding per connection using one of the following SQL commands:: SET CLIENT_ENCODING TO 'UTF-8'; SET NAMES 'UTF-8'; Or, when you use psql, with the command:: \encoding UTF-8; While the following command resets the client encoding to the default encoding:: RESET client_encoding; Since PostgreSQL stores all String-types ((VAR)CHAR, TEXT) characterwise (and not bytewise) according to the client encoding set (an thats independently from the database encoding!) you can store anything lossless if you always set the correct string encoding consistently. A nice feature of PostgreSQL is, that it will convert one encoding to another encoding if you set another client encoding receiving a String from the database then you did storing it. Although not all encoding-combinations are available for transcoding it can transcode any encoding to utf8, so you are enabled to receive only utf8 encoded stuff from the database regardless of the encoding used to save the string data. For more information on which encodings-combinations for transcoding are available and how to manage the charset handling in general see the following links: - `PostgreSQL 8.3.3 Documentation - 22.2. Character Set Support`__ __ http://www.postgresql.org/docs/current/static/multibyte.html Trackbacks ========== - Extracting data from HTML on Sun, 24 May 2009 12:55:58 +0200 in Kore Nordmann - PHP / Projects / Politics A lot of people try to scrape content from HTML - the first approach always seem to be regular expressions, which are incapable of parsing HTML - which I proved earlier, already. So, how to do it properly with PHP? - Published PHP charset/encoding FAQ on Sun, 24 May 2009 12:57:44 +0200 in Kore Nordmann - PHP / Projects / Politics After lots of questions recently on IRC about charsets and encodings, I decided to write up a FAQ about this. The FAQ can now be found in the article section of my website. If you know a better location for this, got additional questions and / or answers feel free to send me a mail about it. - Character sets vs. encodings on Fri, 11 Dec 2009 10:37:23 +0100 in Kore Nordmann - PHP / Projects / Politics Todays PHP Advent article iterates the topic of character sets and encodings once more. While I welcome this in general, because each developer should have at least basic knowledge about this, I would like to clarify some points. Comments ======== - Balázs Bárány at Tue, 03 Jun 2008 10:49:13 +0200 To set the encoding in PostgreSQL similarly, you can do the following: - Create your database using WITH ENCODING 'yourencoding': create database enctest with encoding='utf-8'; - You can display or set the encoding in your psql prompt with \encoding or "set client_encoding to 'encoding'" template1=# \encoding UTF8 template1=# set client_encoding to 'iso-8859-15'; SET template1=# \encoding LATIN9 # This is normal; LATIN9 is PostgreSQL's name for "West European with Euro". template1=# set client_encoding to 'iso-8859-15'; SET template1=# \encoding LATIN9 You can use the "set client_encoding" form in your scripts to choose an encoding. (PostgreSQL can convert between encodings, so you can have a future-proof UTF-8 database but for the moment work with PHP scripts and webpages in ISO-8859-15 or other single-byte character sets.) - You can also set the default client encoding for a database: ALTER DATABASE enctest SET client_encoding='iso-8859-15'; - linuxamp at Thu, 02 Oct 2008 05:33:42 +0200 Interesting writeup. I have always been using the terms without really understanding the differences. Thanks - Sajal at Thu, 15 Jan 2009 09:00:54 +0100 Awesome !!!! - Nicolas Grekas at Sat, 14 Feb 2009 10:45:30 +0100 About "accept-charset", have you tried the reverse approach ? As UTF-8 is not handled like any other charset by browsers, it may be interesting : what I mean is : - build an UTF-8 page, with accept-charset="UTF-8" - then manually change the page encoding in the browser GUI (menu "Display" > "Encoding" in Firefox) - then fill form fields in the page - what is the encoding received on the server ? - Martin at Mon, 23 Mar 2009 16:25:00 +0100 Thank you Kore, great compendium! - Leon at Mon, 31 Aug 2009 21:40:31 +0200 Premium quality content!! Vielen Dank! Much kudos 2U from Bavaria! - Alfonso at Mon, 09 Nov 2009 14:30:32 +0100 Always been confused on this matter reading other tutorials, but yours is very very clear! Thank you! - Thijs Feryn at Thu, 28 Jan 2010 12:40:13 +0100 Hi Kore Thanks for this lovely FAQ document. I've run into tons of charset issues over the years and more recently I managed to store UTF8 & ISO data in one MySQL table (by accident I must admit). Fixing this caused me such a headache that I written a blog article about it. It deals with a lot of stuff you have listed nicely in this FAQ. Check it out: http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/ I will surely talk to you about this topic during the speakers dinner at phpbnl10 tomorrow night. Much respect Thijs Feryn