PHP Charset FAQ

First published at Friday, 30 May 2008

Warning: This blog post is more then 17 years old – read and use with care.

PHP Charset FAQ

List of questions

General
HTTP related
PHP related
Databases
- How to ensure using the right encoding in my MySQL database?
- How to ensure using the right encoding in my PostgreSQL database?
Comments

If the FAQ was helpful to you, you can order me a thank you here: http://wishlist.kore-nordmann.de/

General

What is the difference between unicode and UTF-8/UTF-16/...?

Unicode is a charset, which means just a set of characters, which says nothing about how the characters are actually stored (mapped to bytes).

UTF-8 / UTF-16 / ... are encodings which define how a character is mapped to bytes in a string or byte array.

Between UTF-8, UTF-16 and UTF-32 basically the amount of bytes used to encode some character differs. UTF-8 uses 1 byte for the characters defined in ASCII, and a dynamic width of two to four for other characters. UTF-32 constantly uses four bytes for each character which makes iterating over characters in a string trivial, but consumes much more space for common strings. Choosing the correct default encoding for your application is not trivial and depends on common strings and common string usage in your application.

See also:

What is the difference between a charset and an encoding?

What is the difference between a charset and an encoding?

A charset is a set of characters which can be represented in a certain encoding. The encoding actually defines which bytes are used for a certain character.

An example: The character '☯' is available in the unicode charset, and probably other charsets, too. But there are different encodings for this character, even for the same charset, like the following:

Unicode character: ☯

UTF-8 encoded: 0xE2 0x98 0xAF
UTF-16 encoded: 0x26 0x2F

What is the difference between a character and a byte?

Generally a byte, or a sequence of bytes, ist just an internal representation of a character depending on the used encoding. The encoding maps characters to bytes or byte sequences.

In singlebyte encodings each character in the charset maps to exactly one byte, so bytes and characters can actually be confused, because they always represent just the same.

On the other hand multibyte encodings like UTF-8 or UTF-16 are nowadays more common and map characters to multiple bytes, so that different character representations contain the same bytes:

ⅱ, UTF-8 encoded: 0xE2 0x85 0xB1
ⅲ, UTF-8 encoded: 0xE2 0x85 0xB2

As you can see, the first two bytes used for both characters are the same, while only the third byte differs.

What is the difference between a charset and an encoding?
What does "multibyte charset/encoding" mean?

How do I determine the charset/encoding of a string?

There is no way to do this right.

For example all ISO 8859-* encodings work on all combinations of bytes, so that there is no way to know about the used encoding. You may guess the encoding, if you know the contents of a string, like detecting multiple expected occurrences of some not common characters.

UTF-8 multibyte character sequences do have some characteristics you may check for, but each UTF-8 string may also be an ISO 8859-* string. To check if a string is a valid sequence of UTF-8 encoded characters you could use the following regular expression, but this won't actually tell you, if the string is UTF-8, it still might be in nearly any other encoding:

(^(?:
    [\x00-\x7f] |
    [\xc0-\xdf][\x80-\xff] |
    [\xe0-\xef][\x80-\xff]{2} |
    [\xf0-\xf7][\x80-\xff]{3}
)*$)x

Don't use it on big strings though, it may crash PCRE.

What does "multibyte charset/encoding" mean?

A multibyte charset uses not only one but multiple bytes for one character. The amount of bytes used for one character may be dynamic, like in UTF-8 and UTF-16, or fixed like in UTF-32.

In a singlebyte encoding only 256 different characters can be represented, as this is the number of different values one byte can have (2^8). This number of characters is not sufficient for lots of languages, and especially not when you try to fit characters of multiple languages in one charset and encoding (unicode and UTF-*).

See also:

What is the difference between unicode and UTF-8/UTF-16/...?

What does string transliteration mean?

When converting between different charsets it may happen that not all characters of the source string are available in the destination charset. In this case transliteration aims to provide another character or sequence of characters, which sufficiently replace the source character in the destination charset. Common transliterations for the german umlaut ä may be:

ä => ae
ä => a

Transliteration in PHP

In PHP you can transliterate strings using different functions.

The iconv() function supports very basic transliteration depending on the installed locales. Unknown characters are transliterated, when you append the string //TRANSLIT to the destination encoding, like shown in the conversion example: How do I change the encoding of a string?.
The extension pecl/translit offers transliterations between several charsets, not depending on installed locales on your system. Check out its documentation for details.

HTTP related

Why do I have such strange characters on my website?

The content you send is encoded in a different encodings than specified for the client, or than the client detects.

When speaking of websites we talk about browsers in most cases, which determine the encoding of a website basing on two factors:

The Content-Type header send by the webserver
The content-type meta tag in the (X)HTML header

To ensure that the browser detects the correct charset and encoding of your website you should set both to the same value. The content type header sent by your website could either be configured in the web server configuration, or sent explicitly by PHP, using something like:

header( 'Content-type: text/html; charset=utf-8' );

See also:

How do I send the correct charset/encoding for $client?
How do I determine the charset/encoding of a string?
Why shouln't I use htmlentities?

Which charset does $client send?

In most cases the client sends the encoding specified for the website. This works only if the encoding could be determined doubtless by the browser.

If the browser can't detect the specified charset, you cannot really know about the charset of the input strings. This is one reason you should respect the HTTP header "Accept-Encoding", even most browsers nowadays know about UTF-8.

In every case there may be misbehaving browsers, or clients which just try to feed your application with invalid data. That's why you should try to gracefully handle strings, which do not match your expected encoding.

How do I send the correct charset/encoding for $client?

Most HTTP clients send a header with a list of encodings/charsets they can understand and prefer. The header is called Accept-Charset and is available in PHP in $_SERVER['HTTP_ACCEPT_CHARSET']. Most clients actually send lists of encodings they understand instead of lists of charset. A typical header may look like:

utf-8;q=1.0, windows-1251;q=0.8, cp1251;q=0.8, koi8-r;q=0.8, *;q=0.5

The header tells, that the client likes UTF-8 most (which is an encoding), and thinks it can also handle all kinds of charsets / encodings with a slight preference on windows-1251, cp1251 and koi8-r. Nowadays most clients can handle UTF-8 encoded content - for other clients you should either use plain UTF-7, which should work in most cases, or transform the contents to the requested encoding before sending.

There is also a HTTP header Accept-Encoding, which actually does not have anything to do with the encodings we talk about in this FAQ, but contains a list of usable compression formats.

How do I ensure the correct overall charset/encoding in my web application?

You need to ensure that you know the correct charset at every point of your application. For all input strings you should recode them directly when receiving - maybe at a central input handler where you convert everything to the charset you use consistently in your application. For this read: Which charset does $client send?.

With every content following a defined encoding you should pay attention that the backend uses the same encoding and also returns this. Check the databases section for this, depending on the type of backend you use.

With every content in your defined encoding you should also ensure that you set this encoding in the output. Check How do I send the correct charset/encoding for $client? for details.

Does accept-charset help in HTML forms?

Short answer: No.

HTML forms may define the attribute accept-charset, which tells the browser which encoding to use when sending data to the server. The funny thing is, that it breaks horribly and is not handled properly by any browser. A form using this might look like:

<form action="..." accept-charset="$encoding">
   <legend>Test form using $encoding</legend>

   <input type="hidden" name="hidden_input" value="..." />
   <label>Input string
     <input type="text" name="user_input" />
   </label>

   <label>
     Submit <input type="submit" value="Submit" />
   </label>
</form>

Testing this with different browsers on a website, which is encoded using UTF-8, this causes the following results with different browsers. The "acccept-charset" column contains the value, specified in the accept-value attribute of the form. The "Contained string" column shows the string passed in by the browser for the hiden field, while the "Pasted string" is something the user entered.

Browser	accept-charset	Contained string	Pasted string	Received
Firefox/3.0.3	ISO 8859-15	��	��☠	ISO, entity
Firefox/3.0.3	UTF-16	öäü	öäüß☠	UTF-8
Firefox/3.0.3	UTF-8	öäü	öäüß☠	UTF-8
Opera/9.61	ISO 8859-15	��	��☠	ISO, entity
Opera/9.61	UTF-16	öäü	öäüß☠	UTF-8
Opera/9.61	UTF-8	öäü	öäüß☠	UTF-8
Konqueror/4.1	ISO 8859-15	��	��☠	ISO, entity
Konqueror/4.1	UTF-16	∅	∅	Nothing
Konqueror/4.1	UTF-8	öäü	öäüß☠	UTF-8
MSIE 7.0	ISO 8859-15	öäü	öäüß☠	UTF-8
MSIE 7.0	UTF-16	öäü	öäüß☠	UTF-8
MSIE 7.0	UTF-8	öäü	öäüß☠	UTF-8

Firefox and Opera at least send the correct values for the ISO 8859-15 encoding, while characters, which are not contained in the associated charset are represented by their decimal XML entites - which is OK.

When requesting the value as UTF-16, no browser does it right, Konqueror 4.1 does not even send anything at all, all other browsers just keep sending UTF-8 strings.

The Internet Explorer completely ignores the accept-charset="" attribute and always sends the values in the defined document encoding.

Generally speaking: It will not hurt, if you define the accept-charset with the same encoding you use on the site anyways, but it will not buy you anything either. Defining an encoding which differs from the site encoding does not work consistently across different browsers.

PHP related

Which charset/encoding do strings have in PHP?

PHP 5

Short answer: None.

Longer answer: PHP does not maintain any charset information for strings. Strings are just arrays of bytes in PHP. This is especially problematic for multibyte encodings, because there is strict difference between character and byte context, while a character always equals a byte in singlebyte encodings.

You should try to maintain the information about the encoding of a string yourself. The easiest way is to ensure that only one single encoding is used in your complete application / framework / backend.

See also:

How do I iterate characterwise over a string?
Which charset does $client send?
How do I ensure the correct overall charset/encoding in my web application?
How do I determine the charset/encoding of a string?
What does "multibyte charset/encoding" mean?

How do I change the encoding of a string?

PHP 5

There are several functions which offer easy encoding conversions, but there are also some stumbling blocks you should remember.

The basic conversion for example may be done by the functions iconv() or mb_convert_encoding(), but the following problems may occur:

The destination encoding only implements a charset subset of the source encoding
The source string contains invalid data. Not all sequences of bytes are valid in each encoding.

The iconv() function knows the two parameters //TRANSLIT and //IGNORE, which are appended to the destination encoding string, which define how to handle missing characters in the destination encoding. An example could look like:

<?php
$string = "Some string‽";
echo iconv( "UTF-8", "ISO-8859-1//TRANSLIT", $string );

// Output: Some string?
?>

Invalid characters lead to warnings during conversions.

utf8_(de|en)code

There is special handler for ISO 8859-1 <-> UTF-8 en- and decoding in PHP, the two functions utf8_decode() and utf8_encode(). utf8_decode() converts UTF-8 encoded strings to ISO 8859-1, while utf8_encode() does the opposite:

<?php
$string = "Some string‽";
echo utf8_decode( $string );

// Output: Some string?
?>

Characters, which have no equivalent in the charset which can be encoded by ISO 8859-1 are replaced by question marks. These functions are therefore quite useless, since they only know about two encodings, while ISO 8859-1 can only encode 256 characters, which is a really small subset of unicode. You should always use iconv() instead, since you have far more control over the conversion this way.

See also:

What does string transliteration mean?

How do I iterate characterwise over a string?

PHP 5

This is a bit harder then you might think at a first glance. First: You need to know the actual encoding of the string you want to iterate over. If it is a singlebyte encoding like ISO 8859-* you can just iterate bytewise over the string - see the next question for details on this.

If it is a multibyte encoding there are basically three options for you:

Convert to a encoding which uses the same amount of bytes for each character.
You should choose the encoding you transform to depending on the charset of your strings. Safe for most cases are the four byte encodings UCS4 or UTF-32, which also can encode the full unicode charset, but use constantly four bytes per character, which makes iterating over the string trivial.
You might be able to convert your string to a singlebyte encoding like ISO 8859-1, if this covers the complete used charset. Beware of possible information loss here.
You can use methods of some multibyte extensions to extract string parts, for example mb_substr().
Respect the encoding specific multibyte characteristics while scanning the string. This depends a lot on the encoding you use and involves lots of checks. So this might be the slowest solution.

An example using iconv for conversion for the first method could look like:

<?php
$string = 'Some string‽';
$fixedWidthString = iconv( 'UTF-8', 'UTF-32', $string );
$chars  = strlen( $fixedWidthString );
for ( $char = 0; $char < $chars; $char += 4 )
{
    echo iconv( 'UTF-32', 'UTF-8', substr( $fixedWidthString, $char, 4 ) ), ' ';
}

// Output: S o m e   s t r i n g ‽
?>

As you can see from the output we iterated characterwise over the string and shouldn't have any loss of information, as both encodings are defined for the same charset.

See also:

Which charset/encoding do strings have in PHP?
How do i iterate bytewise over a string?
How do I change the encoding of a string?
How do I determine the charset/encoding of a string?
What is the difference between a charset and an encoding?

How do i iterate bytewise over a string?

PHP 5

That's easy. As strings do not have any associated charset or encoding in PHP, but are only arrays of bytes, the common way iterating over a string is the correct way, like this example shows:

<?php
$string = 'Some string‽';
$bytes  = strlen( $string );
for ( $byte = 0; $byte < $bytes; ++$byte )
{
    echo '0x', dechex( ord( $string[$byte] ) ), ' ';
}

// Output: 0x53 0x6f 0x6d 0x65 0x20 0x73 0x74 0x72 0x69 0x6e 0x67 0xe2 0x80 0xbd
?>

Which echos 14 hexadecimal numbers, including the three numbers for the last character which is a UTF-8 multibyte character.

See also:

How do I iterate characterwise over a string?
What does "multibyte charset/encoding" mean?

How do I determine the length of a string?

PHP 5

Since PHP does not have any charset or encoding associated with strings, you can only easily check for the number of bytes a string consists of. This is, what strlen() does for you.

If you know that the encoding used for the string is a singlebyte encoding, like ISO 8859-*, you can also just use strlen(), as each byte maps to exactly one character.

For multibyte encodings this approach does not work well. Here basically the same applies as for iterating over the characters of a string:

You might want to convert (or use only) encodings with a fixed number of bytes for a character in your application. Using this you can just devide the value returned by strlen() by the mapping factor - in case of UTF-32 this would be 4, for example.
In the charset handling extensions there are special functions which help you determine the number of characters in a string depending on the encoding. These are:
- int iconv_strlen ( string str [, string charset] )
- int mb_strlen ( string str [, string encoding] )

See also:

Which charset/encoding do strings have in PHP?
How do I iterate characterwise over a string?
How do i iterate bytewise over a string?
How do I change the encoding of a string?
How do I determine the charset/encoding of a string?

Why shouln't I use htmlentities?

PHP 5

Even you ensure the correct encoding throughout your complete application you end up with strange characters on your website? In this case this normally means that the function used for escaping your output was not aware of the provided charset. In case of XML / (X)HTML this probably mean, that you ignored the third parameter of the htmlspecialchars() function, which allows you to specify the encoding of the string, which is especially important for multibyte encodings like UTF-8, as you can see in the following example:

<?php
var_dump( htmlentities( 'Ü' ) );
// string(9) "&Atilde;�"
var_dump( htmlspecialchars( 'Ü', ENT_QUOTES, 'UTF-8' ) );
// string(2) "Ü"
?>

For UTF-8 this should not happen with plain htmlspecialchars() - because the characters, which are escaped by htmlspecialchars() are not part of multibyte characters in UTF-8, but it may happen with other multibyte encodings.

When the characters are available in the characterset / encoding reported to the client, you don't actually need the conversion to entities of special characters. Subsumption:

Use htmlspecialchars(), because it is enough if you set the character in the client.
Specify your used encoding for htmlspecialchars(), otherwise it might mess up your characters.

See also:

What does "multibyte charset/encoding" mean?
Why do I have such strange characters on my website?
How do I send the correct charset/encoding for $client?
How do I ensure the correct overall charset/encoding in my web application?

Which encoding should I use for my source files?

PHP 5

Each ASCII compatible encoding will do the job. All language constructs in PHP only use the characters defined in ASCII, which are the contained and mapped to the same bytes for all ISO 8859-* charsets/encodings and in UTF-8.

For your own (variable, class & function) names you may also use bytes with a value greater the 0x7f, which means that you can embed special characters, and even use the full Unicode charset with UTF-8 for your function names. Since PHP function names are handled binary-safe this would mean, that you need to ensure all your files are edited with the same encoding configured in your editor.

You can specify the encoding of your PHP files explicitly, but this has different issues in different versions of PHP. See the manual for details.

Databases

How to ensure using the right encoding in my MySQL database?

For MySQL there is only one thing relevant to maintain the correct encoding of your content:

The client / connection encoding

You can set this either globally in your client configuration file, which should reside somewhere like /etc/mysql/my.cnf. There you should add the following lines:

[mysql]
default-character-set=utf8

Where "utf8" is your desired encoding.

You can also set the connection charset only for one connection, by sending the following query before you send data to the database server:

SET NAMES utf8

Database / table encoding

The database and table encoding only defines, how the data is actually stored in MySQL and does not relate to the encoding returned to you when querying the data. Everything should be fine if you specified The client / connection encoding correctly and the table and database encoding can handle the same charset like your application uses.

For columns / tables, where you don't need the full unicode charset you should use a minimal charset to keep the amount of used memory and keysizes low.

See also:

What is the difference between a charset and an encoding?
Which charset does $client send?
MySQL 5.0 Reference Manual :: 9.1 Character Set Support

How to ensure using the right encoding in my PostgreSQL database?

Author: Malte Schirmacher

You can set this globally in your server configuration file, which should reside somewhere like /etc/postgresql/<pg-version>/main/postgresql.conf.

There you can add the following line:

client_encoding = utf8

If this option is not set the encodings defaults to the database-encoding.

You can also specify the client encoding per connection using one of the following SQL commands:

SET CLIENT_ENCODING TO 'UTF-8';
SET NAMES 'UTF-8';

Or, when you use psql, with the command:

\encoding UTF-8;

While the following command resets the client encoding to the default encoding:

RESET client_encoding;

Since PostgreSQL stores all String-types ((VAR)CHAR, TEXT) characterwise (and not bytewise) according to the client encoding set (an thats independently from the database encoding!) you can store anything lossless if you always set the correct string encoding consistently.

A nice feature of PostgreSQL is, that it will convert one encoding to another encoding if you set another client encoding receiving a String from the database then you did storing it. Although not all encoding-combinations are available for transcoding it can transcode any encoding to utf8, so you are enabled to receive only utf8 encoded stuff from the database regardless of the encoding used to save the string data.

For more information on which encodings-combinations for transcoding are available and how to manage the charset handling in general see the following links:

PostgreSQL 8.3.3 Documentation - 22.2. Character Set Support

Comments

Balázs Bárány at Tuesday, 3.6. 2008

To set the encoding in PostgreSQL similarly, you can do the following:

Create your database using WITH ENCODING 'yourencoding':

create database enctest with encoding='utf-8';

You can display or set the encoding in your psql prompt with encoding or "set client_encoding to 'encoding'"

template1=# \encoding
UTF8
template1=# set client_encoding to 'iso-8859-15';
SET
template1=# \encoding
LATIN9

# This is normal; LATIN9 is PostgreSQL's name for "West European with
Euro".

template1=# set client_encoding to 'iso-8859-15';
SET
template1=# \encoding
LATIN9

You can use the "set client_encoding" form in your scripts to choose an encoding. (PostgreSQL can convert between encodings, so you can have a future-proof UTF-8 database but for the moment work with PHP scripts and webpages in ISO-8859-15 or other single-byte character sets.)

You can also set the default client encoding for a database:
ALTER DATABASE enctest SET client_encoding='iso-8859-15';

linuxamp at Thursday, 2.10. 2008

Interesting writeup. I have always been using the terms without really understanding the differences.

Thanks

Sajal at Thursday, 15.1. 2009

Awesome !!!!

Nicolas Grekas at Saturday, 14.2. 2009

About "accept-charset", have you tried the reverse approach ? As UTF-8 is not handled like any other charset by browsers, it may be interesting :

what I mean is :

build an UTF-8 page, with accept-charset="UTF-8"
then manually change the page encoding in the browser GUI (menu "Display" > "Encoding" in Firefox)
then fill form fields in the page
what is the encoding received on the server ?

Martin at Monday, 23.3. 2009

Thank you Kore, great compendium!

Leon at Monday, 31.8. 2009

Premium quality content!! Vielen Dank! Much kudos 2U from Bavaria!

Alfonso at Monday, 9.11. 2009

Always been confused on this matter reading other tutorials, but yours is very very clear! Thank you!

Thijs Feryn at Thursday, 28.1. 2010

Hi Kore

Thanks for this lovely FAQ document. I've run into tons of charset issues over the years and more recently I managed to store UTF8 & ISO data in one MySQL table (by accident I must admit).

Fixing this caused me such a headache that I written a blog article about it. It deals with a lot of stuff you have listed nicely in this FAQ.

Check it out: http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/

I will surely talk to you about this topic during the speakers dinner at phpbnl10 tomorrow night.

Much respect Thijs Feryn

ivan at Sunday, 28.3. 2010

the best article about PHP over the internet Thanks a lot

spaze at Sunday, 6.6. 2010

Preferred way of setting the connection character set is using the mysql_set_charset() (mysqli_set_charset etc.) function http://php.net/mysql_set_charset. This way the escaping functions know the used character set and are escaping properly even the characters which would normally allow for SQL injection in some character sets like GBK.

In other words, SET NAMES does not set the character set for mysql_real_escape_string() and so not all required characters are escaped making a space for SQLi. Ilia has proof of concept over there http://ilia.ws/archives/103-mysql_real_escape_string-versus-Prepared-Statements.html

shaffy at Tuesday, 10.8. 2010

Awesome

Iain Cambridge at Wednesday, 10.11. 2010

Nice, shame your flattr isn't working otherwise I would have clicked :(

Georgi at Thursday, 16.8. 2012

You have no idea how much I love you right now for mentioning htmlentities. Bookmarked for future reference.

David Spector at Monday, 29.2. 2016

I think the most efficient way to process each character in a UTF-8 (or similarly encoded) string would be to work through the string using mb_substr. In each iteration of the processing loop, mb_substr would be called twice (to find the next character and the remaining string). It would pass only the remaining string to the next iteration. This way, the main overhead in each iteration would be finding the next character (done twice), which takes only one to five or so operations, depending on the byte length of the character.

If this description is not clear, let me know and I'll provide a working PHP function.

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog:

A classic RSS feed (for example in Portalific)
I'll toot about it on mastodon
All updates will go to LinkedIn, as well