PHP Charset FAQ

First published at Friday, 30 May 2008

Warning: This blog post is more then 16 years old – read and use with care.

PHP Charset FAQ

If the FAQ was helpful to you, you can order me a thank you here: http://wishlist.kore-nordmann.de/

General

What is the difference between unicode and UTF-8/UTF-16/...?

Unicode is a charset, which means just a set of characters, which says nothing about how the characters are actually stored (mapped to bytes).

UTF-8 / UTF-16 / ... are encodings which define how a character is mapped to bytes in a string or byte array.

Between UTF-8, UTF-16 and UTF-32 basically the amount of bytes used to encode some character differs. UTF-8 uses 1 byte for the characters defined in ASCII, and a dynamic width of two to four for other characters. UTF-32 constantly uses four bytes for each character which makes iterating over characters in a string trivial, but consumes much more space for common strings. Choosing the correct default encoding for your application is not trivial and depends on common strings and common string usage in your application.

See also:

What is the difference between a charset and an encoding?

A charset is a set of characters which can be represented in a certain encoding. The encoding actually defines which bytes are used for a certain character.

An example: The character '☯' is available in the unicode charset, and probably other charsets, too. But there are different encodings for this character, even for the same charset, like the following:

Unicode character: ☯ UTF-8 encoded: 0xE2 0x98 0xAF UTF-16 encoded: 0x26 0x2F

What is the difference between a character and a byte?

Generally a byte, or a sequence of bytes, ist just an internal representation of a character depending on the used encoding. The encoding maps characters to bytes or byte sequences.

In singlebyte encodings each character in the charset maps to exactly one byte, so bytes and characters can actually be confused, because they always represent just the same.

On the other hand multibyte encodings like UTF-8 or UTF-16 are nowadays more common and map characters to multiple bytes, so that different character representations contain the same bytes:

ⅱ, UTF-8 encoded: 0xE2 0x85 0xB1 ⅲ, UTF-8 encoded: 0xE2 0x85 0xB2

As you can see, the first two bytes used for both characters are the same, while only the third byte differs.

How do I determine the charset/encoding of a string?

There is no way to do this right.

For example all ISO 8859-* encodings work on all combinations of bytes, so that there is no way to know about the used encoding. You may guess the encoding, if you know the contents of a string, like detecting multiple expected occurrences of some not common characters.

UTF-8 multibyte character sequences do have some characteristics you may check for, but each UTF-8 string may also be an ISO 8859-* string. To check if a string is a valid sequence of UTF-8 encoded characters you could use the following regular expression, but this won't actually tell you, if the string is UTF-8, it still might be in nearly any other encoding:

(^(?: [\x00-\x7f] | [\xc0-\xdf][\x80-\xff] | [\xe0-\xef][\x80-\xff]{2} | [\xf0-\xf7][\x80-\xff]{3} )*$)x

Don't use it on big strings though, it may crash PCRE.

What does "multibyte charset/encoding" mean?

A multibyte charset uses not only one but multiple bytes for one character. The amount of bytes used for one character may be dynamic, like in UTF-8 and UTF-16, or fixed like in UTF-32.

In a singlebyte encoding only 256 different characters can be represented, as this is the number of different values one byte can have (2^8). This number of characters is not sufficient for lots of languages, and especially not when you try to fit characters of multiple languages in one charset and encoding (unicode and UTF-*).

See also:

What does string transliteration mean?

When converting between different charsets it may happen that not all characters of the source string are available in the destination charset. In this case transliteration aims to provide another character or sequence of characters, which sufficiently replace the source character in the destination charset. Common transliterations for the german umlaut ä may be:

ä => ae ä => a
Transliteration in PHP

In PHP you can transliterate strings using different functions.

  1. The iconv() function supports very basic transliteration depending on the installed locales. Unknown characters are transliterated, when you append the string //TRANSLIT to the destination encoding, like shown in the conversion example: How do I change the encoding of a string?.

  2. The extension pecl/translit offers transliterations between several charsets, not depending on installed locales on your system. Check out its documentation for details.

Databases

How to ensure using the right encoding in my MySQL database?

For MySQL there is only one thing relevant to maintain the correct encoding of your content:

The client / connection encoding

You can set this either globally in your client configuration file, which should reside somewhere like /etc/mysql/my.cnf. There you should add the following lines:

[mysql] default-character-set=utf8

Where "utf8" is your desired encoding.

You can also set the connection charset only for one connection, by sending the following query before you send data to the database server:

SET NAMES utf8
Database / table encoding

The database and table encoding only defines, how the data is actually stored in MySQL and does not relate to the encoding returned to you when querying the data. Everything should be fine if you specified The client / connection encoding correctly and the table and database encoding can handle the same charset like your application uses.

For columns / tables, where you don't need the full unicode charset you should use a minimal charset to keep the amount of used memory and keysizes low.

See also:

How to ensure using the right encoding in my PostgreSQL database?

Author
Malte Schirmacher

You can set this globally in your server configuration file, which should reside somewhere like /etc/postgresql/<pg-version>/main/postgresql.conf.

There you can add the following line:

client_encoding = utf8

If this option is not set the encodings defaults to the database-encoding.

You can also specify the client encoding per connection using one of the following SQL commands:

SET CLIENT_ENCODING TO 'UTF-8'; SET NAMES 'UTF-8';

Or, when you use psql, with the command:

\encoding UTF-8;

While the following command resets the client encoding to the default encoding:

RESET client_encoding;

Since PostgreSQL stores all String-types ((VAR)CHAR, TEXT) characterwise (and not bytewise) according to the client encoding set (an thats independently from the database encoding!) you can store anything lossless if you always set the correct string encoding consistently.

A nice feature of PostgreSQL is, that it will convert one encoding to another encoding if you set another client encoding receiving a String from the database then you did storing it. Although not all encoding-combinations are available for transcoding it can transcode any encoding to utf8, so you are enabled to receive only utf8 encoded stuff from the database regardless of the encoding used to save the string data.

For more information on which encodings-combinations for transcoding are available and how to manage the charset handling in general see the following links:


Comments

Balázs Bárány at Tuesday, 3.6. 2008

To set the encoding in PostgreSQL similarly, you can do the following:

  • Create your database using WITH ENCODING 'yourencoding':

create database enctest with encoding='utf-8';

  • You can display or set the encoding in your psql prompt with encoding or "set client_encoding to 'encoding'"

template1=# \encoding UTF8 template1=# set client_encoding to 'iso-8859-15'; SET template1=# \encoding LATIN9 # This is normal; LATIN9 is PostgreSQL's name for "West European with Euro". template1=# set client_encoding to 'iso-8859-15'; SET template1=# \encoding LATIN9

You can use the "set client_encoding" form in your scripts to choose an encoding. (PostgreSQL can convert between encodings, so you can have a future-proof UTF-8 database but for the moment work with PHP scripts and webpages in ISO-8859-15 or other single-byte character sets.)

  • You can also set the default client encoding for a database:

    ALTER DATABASE enctest SET client_encoding='iso-8859-15';

linuxamp at Thursday, 2.10. 2008

Interesting writeup. I have always been using the terms without really understanding the differences.

Thanks

Sajal at Thursday, 15.1. 2009

Awesome !!!!

Nicolas Grekas at Saturday, 14.2. 2009

About "accept-charset", have you tried the reverse approach ? As UTF-8 is not handled like any other charset by browsers, it may be interesting :

what I mean is :

  • build an UTF-8 page, with accept-charset="UTF-8"

  • then manually change the page encoding in the browser GUI (menu "Display" > "Encoding" in Firefox)

  • then fill form fields in the page

  • what is the encoding received on the server ?

Martin at Monday, 23.3. 2009

Thank you Kore, great compendium!

Leon at Monday, 31.8. 2009

Premium quality content!! Vielen Dank! Much kudos 2U from Bavaria!

Alfonso at Monday, 9.11. 2009

Always been confused on this matter reading other tutorials, but yours is very very clear! Thank you!

Thijs Feryn at Thursday, 28.1. 2010

Hi Kore

Thanks for this lovely FAQ document. I've run into tons of charset issues over the years and more recently I managed to store UTF8 & ISO data in one MySQL table (by accident I must admit).

Fixing this caused me such a headache that I written a blog article about it. It deals with a lot of stuff you have listed nicely in this FAQ.

Check it out: http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/

I will surely talk to you about this topic during the speakers dinner at phpbnl10 tomorrow night.

Much respect Thijs Feryn

ivan at Sunday, 28.3. 2010

the best article about PHP over the internet Thanks a lot

spaze at Sunday, 6.6. 2010

Preferred way of setting the connection character set is using the mysql_set_charset() (mysqli_set_charset etc.) function http://php.net/mysql_set_charset. This way the escaping functions know the used character set and are escaping properly even the characters which would normally allow for SQL injection in some character sets like GBK.

In other words, SET NAMES does not set the character set for mysql_real_escape_string() and so not all required characters are escaped making a space for SQLi. Ilia has proof of concept over there http://ilia.ws/archives/103-mysql_real_escape_string-versus-Prepared-Statements.html

shaffy at Tuesday, 10.8. 2010

Awesome

Iain Cambridge at Wednesday, 10.11. 2010

Nice, shame your flattr isn't working otherwise I would have clicked :(

Georgi at Thursday, 16.8. 2012

You have no idea how much I love you right now for mentioning htmlentities. Bookmarked for future reference.

David Spector at Monday, 29.2. 2016

I think the most efficient way to process each character in a UTF-8 (or similarly encoded) string would be to work through the string using mb_substr. In each iteration of the processing loop, mb_substr would be called twice (to find the next character and the remaining string). It would pass only the remaining string to the next iteration. This way, the main overhead in each iteration would be finding the next character (done twice), which takes only one to five or so operations, depending on the byte length of the character.

If this description is not clear, let me know and I'll provide a working PHP function.

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog: