PHP charset/encoding FAQ - Kore Nordmann - PHP / Projects / Politics

Kore Nordmann - PHP / Projects / Politics

By Kore Nordmann, first published at Fri, 30 May 2008 09:10:32 +0200

PHP Charset FAQ

If the FAQ was helpful to you, you can order me a thank you here: http://wishlist.kore-nordmann.de/

General

What is the difference between unicode and UTF-8/UTF-16/...?

Unicode is a charset, which means just a set of characters, which says nothing about how the characters are actually stored (mapped to bytes).

UTF-8 / UTF-16 / ... are encodings which define how a character is mapped to bytes in a string or byte array.

Between UTF-8, UTF-16 and UTF-32 basically the amount of bytes used to encode some character differs. UTF-8 uses 1 byte for the characters defined in ASCII, and a dynamic width of two to four for other characters. UTF-32 constantly uses four bytes for each character which makes iterating over characters in a string trivial, but consumes much more space for common strings. Choosing the correct default encoding for your application is not trivial and depends on common strings and common string usage in your application.

See also:

What is the difference between a charset and an encoding?

A charset is a set of characters which can be represented in a certain encoding. The encoding actually defines which bytes are used for a certain character.

An example: The character '☯' is available in the unicode charset, and probably other charsets, too. But there are different encodings for this character, even for the same charset, like the following:

Unicode character: ☯ UTF-8 encoded: 0xE2 0x98 0xAF UTF-16 encoded: 0x26 0x2F

What is the difference between a character and a byte?

Generally a byte, or a sequence of bytes, ist just an internal representation of a character depending on the used encoding. The encoding maps characters to bytes or byte sequences.

In singlebyte encodings each character in the charset maps to exactly one byte, so bytes and characters can actually be confused, because they always represent just the same.

On the other hand multibyte encodings like UTF-8 or UTF-16 are nowadays more common and map characters to multiple bytes, so that different character representations contain the same bytes:

ⅱ, UTF-8 encoded: 0xE2 0x85 0xB1 ⅲ, UTF-8 encoded: 0xE2 0x85 0xB2

As you can see, the first two bytes used for both characters are the same, while only the third byte differs.

How do I determine the charset/encoding of a string?

There is no way to do this right.

For example all ISO 8859-* encodings work on all combinations of bytes, so that there is no way to know about the used encoding. You may guess the encoding, if you know the contents of a string, like detecting multiple expected occurrences of some not common characters.

UTF-8 multibyte character sequences do have some characteristics you may check for, but each UTF-8 string may also be an ISO 8859-* string. To check if a string is a valid sequence of UTF-8 encoded characters you could use the following regular expression, but this won't actually tell you, if the string is UTF-8, it still might be in nearly any other encoding:

(^(?: [\x00-\x7f] | [\xc0-\xdf][\x80-\xff] | [\xe0-\xef][\x80-\xff]{2} | [\xf0-\xf7][\x80-\xff]{3} )*$)x

Don't use it on big strings though, it may crash PCRE.

What does "multibyte charset/encoding" mean?

A multibyte charset uses not only one but multiple bytes for one character. The amount of bytes used for one character may be dynamic, like in UTF-8 and UTF-16, or fixed like in UTF-32.

In a singlebyte encoding only 256 different characters can be represented, as this is the number of different values one byte can have (2^8). This number of characters is not sufficient for lots of languages, and especially not when you try to fit characters of multiple languages in one charset and encoding (unicode and UTF-*).

See also:

What does string transliteration mean?

When converting between different charsets it may happen that not all characters of the source string are available in the destination charset. In this case transliteration aims to provide another character or sequence of characters, which sufficiently replace the source character in the destination charset. Common transliterations for the german umlaut ä may be:

ä => ae ä => a
Transliteration in PHP

In PHP you can transliterate strings using different functions.

  1. The iconv() function supports very basic transliteration depending on the installed locales. Unknown characters are transliterated, when you append the string //TRANSLIT to the destination encoding, like shown in the conversion example: How do I change the encoding of a string?.

  2. The extension pecl/translit offers transliterations between several charsets, not depending on installed locales on your system. Check out its documentation for details.

HTTP related

Why do I have such strange characters on my website?

The content you send is encoded in a different encodings than specified for the client, or than the client detects.

When speaking of websites we talk about browsers in most cases, which determine the encoding of a website basing on two factors:

  • The Content-Type header send by the webserver

  • The content-type meta tag in the (X)HTML header

To ensure that the browser detects the correct charset and encoding of your website you should set both to the same value. The content type header sent by your website could either be configured in the web server configuration, or sent explicitly by PHP, using something like:

header( 'Content-type: text/html; charset=utf-8' );

See also:

Which charset does $client send?

In most cases the client sends the encoding specified for the website. This works only if the encoding could be determined doubtless by the browser.

If the browser can't detect the specified charset, you cannot really know about the charset of the input strings. This is one reason you should respect the HTTP header "Accept-Encoding", even most browsers nowadays know about UTF-8.

In every case there may be misbehaving browsers, or clients which just try to feed your application with invalid data. That's why you should try to gracefully handle strings, which do not match your expected encoding.

How do I send the correct charset/encoding for $client?

Most HTTP clients send a header with a list of encodings/charsets they can understand and prefer. The header is called Accept-Charset and is available in PHP in $_SERVER['HTTP_ACCEPT_CHARSET']. Most clients actually send lists of encodings they understand instead of lists of charset. A typical header may look like:

utf-8;q=1.0, windows-1251;q=0.8, cp1251;q=0.8, koi8-r;q=0.8, *;q=0.5

The header tells, that the client likes UTF-8 most (which is an encoding), and thinks it can also handle all kinds of charsets / encodings with a slight preference on windows-1251, cp1251 and koi8-r. Nowadays most clients can handle UTF-8 encoded content - for other clients you should either use plain UTF-7, which should work in most cases, or transform the contents to the requested encoding before sending.

There is also a HTTP header Accept-Encoding, which actually does not have anything to do with the encodings we talk about in this FAQ, but contains a list of usable compression formats.

How do I ensure the correct overall charset/encoding in my web application?

You need to ensure that you know the correct charset at every point of your application. For all input strings you should recode them directly when receiving - maybe at a central input handler where you convert everything to the charset you use consistently in your application. For this read: Which charset does $client send?.

With every content following a defined encoding you should pay attention that the backend uses the same encoding and also returns this. Check the databases section for this, depending on the type of backend you use.

With every content in your defined encoding you should also ensure that you set this encoding in the output. Check How do I send the correct charset/encoding for $client? for details.

Does accept-charset help in HTML forms?

Short answer: No.

HTML forms may define the attribute accept-charset, which tells the browser which encoding to use when sending data to the server. The funny thing is, that it breaks horribly and is not handled properly by any browser. A form using this might look like:

<form action="..." accept-charset="$encoding"> <legend>Test form using $encoding</legend> <input type="hidden" name="hidden_input" value="..." /> <label>Input string <input type="text" name="user_input" /> </label> <label> Submit <input type="submit" value="Submit" /> </label> </form>

Testing this with different browsers on a website, which is encoded using UTF-8, this causes the following results with different browsers. The "acccept-charset" column contains the value, specified in the accept-value attribute of the form. The "Contained string" column shows the string passed in by the browser for the hiden field, while the "Pasted string" is something the user entered.

Browser

accept-charset

Contained string

Pasted string

Received

Firefox/3.0.3

ISO 8859-15

���

����&#9760;

ISO, entity

Firefox/3.0.3

UTF-16

öäü

öäüß☠

UTF-8

Firefox/3.0.3

UTF-8

öäü

öäüß☠

UTF-8

Opera/9.61

ISO 8859-15

���

����&#9760;

ISO, entity

Opera/9.61

UTF-16

öäü

öäüß☠

UTF-8

Opera/9.61

UTF-8

öäü

öäüß☠

UTF-8

Konqueror/4.1

ISO 8859-15

���

����&#9760;

ISO, entity

Konqueror/4.1

UTF-16

Nothing

Konqueror/4.1

UTF-8

öäü

öäüß☠

UTF-8

MSIE 7.0

ISO 8859-15

öäü

öäüß☠

UTF-8

MSIE 7.0

UTF-16

öäü

öäüß☠

UTF-8

MSIE 7.0

UTF-8

öäü

öäüß☠

UTF-8

Firefox and Opera at least send the correct values for the ISO 8859-15 encoding, while characters, which are not contained in the associated charset are represented by their decimal XML entites - which is OK.

When requesting the value as UTF-16, no browser does it right, Konqueror 4.1 does not even send anything at all, all other browsers just keep sending UTF-8 strings.

The Internet Explorer completely ignores the accept-charset="" attribute and always sends the values in the defined document encoding.

Generally speaking: It will not hurt, if you define the accept-charset with the same encoding you use on the site anyways, but it will not buy you anything either. Defining an encoding which differs from the site encoding does not work consistently across different browsers.

PHP related

Which charset/encoding do strings have in PHP?

PHP 5

Short answer: None.

Longer answer: PHP does not maintain any charset information for strings. Strings are just arrays of bytes in PHP. This is especially problematic for multibyte encodings, because there is strict difference between character and byte context, while a character always equals a byte in singlebyte encodings.

You should try to maintain the information about the encoding of a string yourself. The easiest way is to ensure that only one single encoding is used in your complete application / framework / backend.

See also:

How do I change the encoding of a string?

PHP 5

There are several functions which offer easy encoding conversions, but there are also some stumbling blocks you should remember.

The basic conversion for example may be done by the functions iconv() or mb_convert_encoding(), but the following problems may occur:

  • The destination encoding only implements a charset subset of the source encoding

  • The source string contains invalid data. Not all sequences of bytes are valid in each encoding.

The iconv() function knows the two parameters //TRANSLIT and //IGNORE, which are appended to the destination encoding string, which define how to handle missing characters in the destination encoding. An example could look like:

<?php $string = "Some string‽"; echo iconv( "UTF-8", "ISO-8859-1//TRANSLIT", $string ); // Output: Some string? ?>

Invalid characters lead to warnings during conversions.

utf8_(de|en)code

There is special handler for ISO 8859-1 <-> UTF-8 en- and decoding in PHP, the two functions utf8_decode() and utf8_encode(). utf8_decode() converts UTF-8 encoded strings to ISO 8859-1, while utf8_encode() does the opposite:

<?php $string = "Some string‽"; echo utf8_decode( $string ); // Output: Some string? ?>

Characters, which have no equivalent in the charset which can be encoded by ISO 8859-1 are replaced by question marks. These functions are therefore quite useless, since they only know about two encodings, while ISO 8859-1 can only encode 256 characters, which is a really small subset of unicode. You should always use iconv() instead, since you have far more control over the conversion this way.

See also:

How do I iterate characterwise over a string?

PHP 5

This is a bit harder then you might think at a first glance. First: You need to know the actual encoding of the string you want to iterate over. If it is a singlebyte encoding like ISO 8859-* you can just iterate bytewise over the string - see the next question for details on this.

If it is a multibyte encoding there are basically three options for you:

  1. Convert to a encoding which uses the same amount of bytes for each character.

    You should choose the encoding you transform to depending on the charset of your strings. Safe for most cases are the four byte encodings UCS4 or UTF-32, which also can encode the full unicode charset, but use constantly four bytes per character, which makes iterating over the string trivial.

    You might be able to convert your string to a singlebyte encoding like ISO 8859-1, if this covers the complete used charset. Beware of possible information loss here.

  2. You can use methods of some multibyte extensions to extract string parts, for example mb_substr().

  3. Respect the encoding specific multibyte characteristics while scanning the string. This depends a lot on the encoding you use and involves lots of checks. So this might be the slowest solution.

An example using iconv for conversion for the first method could look like:

<?php $string = 'Some string‽'; $fixedWidthString = iconv( 'UTF-8', 'UTF-32', $string ); $chars = strlen( $fixedWidthString ); for ( $char = 0; $char < $chars; $char += 4 ) { echo iconv( 'UTF-32', 'UTF-8', substr( $fixedWidthString, $char, 4 ) ), ' '; } // Output: S o m e s t r i n g ‽ ?>

As you can see from the output we iterated characterwise over the string and shouldn't have any loss of information, as both encodings are defined for the same charset.

See also:

How do i iterate bytewise over a string?

PHP 5

That's easy. As strings do not have any associated charset or encoding in PHP, but are only arrays of bytes, the common way iterating over a string is the correct way, like this example shows:

<?php $string = 'Some string‽'; $bytes = strlen( $string ); for ( $byte = 0; $byte < $bytes; ++$byte ) { echo '0x', dechex( ord( $string[$byte] ) ), ' '; } // Output: 0x53 0x6f 0x6d 0x65 0x20 0x73 0x74 0x72 0x69 0x6e 0x67 0xe2 0x80 0xbd ?>

Which echos 14 hexadecimal numbers, including the three numbers for the last character which is a UTF-8 multibyte character.

See also:

How do I determine the length of a string?

PHP 5

Since PHP does not have any charset or encoding associated with strings, you can only easily check for the number of bytes a string consists of. This is, what strlen() does for you.

If you know that the encoding used for the string is a singlebyte encoding, like ISO 8859-*, you can also just use strlen(), as each byte maps to exactly one character.

For multibyte encodings this approach does not work well. Here basically the same applies as for iterating over the characters of a string:

  1. You might want to convert (or use only) encodings with a fixed number of bytes for a character in your application. Using this you can just devide the value returned by strlen() by the mapping factor - in case of UTF-32 this would be 4, for example.

  2. In the charset handling extensions there are special functions which help you determine the number of characters in a string depending on the encoding. These are:

See also:

Why shouln't I use htmlentities?

PHP 5

Even you ensure the correct encoding throughout your complete application you end up with strange characters on your website? In this case this normally means that the function used for escaping your output was not aware of the provided charset. In case of XML / (X)HTML this probably mean, that you ignored the third parameter of the htmlspecialchars() function, which allows you to specify the encoding of the string, which is especially important for multibyte encodings like UTF-8, as you can see in the following example:

<?php var_dump( htmlentities( 'Ü' ) ); // string(9) "&Atilde;�" var_dump( htmlspecialchars( 'Ü', ENT_QUOTES, 'UTF-8' ) ); // string(2) "Ü" ?>

For UTF-8 this should not happen with plain htmlspecialchars() - because the characters, which are escaped by htmlspecialchars() are not part of multibyte characters in UTF-8, but it may happen with other multibyte encodings.

When the characters are available in the characterset / encoding reported to the client, you don't actually need the conversion to entities of special characters. Subsumption:

  1. Use htmlspecialchars(), because it is enough if you set the character in the client.

  2. Specify your used encoding for htmlspecialchars(), otherwise it might mess up your characters.

See also:

Which encoding should I use for my source files?

PHP 5

Each ASCII compatible encoding will do the job. All language constructs in PHP only use the characters defined in ASCII, which are the contained and mapped to the same bytes for all ISO 8859-* charsets/encodings and in UTF-8.

For your own (variable, class & function) names you may also use bytes with a value greater the 0x7f, which means that you can embed special characters, and even use the full Unicode charset with UTF-8 for your function names. Since PHP function names are handled binary-safe this would mean, that you need to ensure all your files are edited with the same encoding configured in your editor.

You can specify the encoding of your PHP files explicitly, but this has different issues in different versions of PHP. See the manual for details.

Databases

How to ensure using the right encoding in my MySQL database?

For MySQL there is only one thing relevant to maintain the correct encoding of your content:

The client / connection encoding

You can set this either globally in your client configuration file, which should reside somewhere like /etc/mysql/my.cnf. There you should add the following lines:

[mysql] default-character-set=utf8

Where "utf8" is your desired encoding.

You can also set the connection charset only for one connection, by sending the following query before you send data to the database server:

SET NAMES utf8
Database / table encoding

The database and table encoding only defines, how the data is actually stored in MySQL and does not relate to the encoding returned to you when querying the data. Everything should be fine if you specified The client / connection encoding correctly and the table and database encoding can handle the same charset like your application uses.

For columns / tables, where you don't need the full unicode charset you should use a minimal charset to keep the amount of used memory and keysizes low.

See also:

How to ensure using the right encoding in my PostgreSQL database?

Author
Malte Schirmacher

You can set this globally in your server configuration file, which should reside somewhere like /etc/postgresql/<pg-version>/main/postgresql.conf.

There you can add the following line:

client_encoding = utf8

If this option is not set the encodings defaults to the database-encoding.

You can also specify the client encoding per connection using one of the following SQL commands:

SET CLIENT_ENCODING TO 'UTF-8'; SET NAMES 'UTF-8';

Or, when you use psql, with the command:

\encoding UTF-8;

While the following command resets the client encoding to the default encoding:

RESET client_encoding;

Since PostgreSQL stores all String-types ((VAR)CHAR, TEXT) characterwise (and not bytewise) according to the client encoding set (an thats independently from the database encoding!) you can store anything lossless if you always set the correct string encoding consistently.

A nice feature of PostgreSQL is, that it will convert one encoding to another encoding if you set another client encoding receiving a String from the database then you did storing it. Although not all encoding-combinations are available for transcoding it can transcode any encoding to utf8, so you are enabled to receive only utf8 encoded stuff from the database regardless of the encoding used to save the string data.

For more information on which encodings-combinations for transcoding are available and how to manage the charset handling in general see the following links:

If you liked this blog post, or learned something please consider using flattr to contribute back: .

Trackbacks

Comments

Add new comment

Fields with bold names are mandatory.

eZ Components

eZ Components

Exploring PHP

Exploring PHP

Hire me

Amazon wishlist

Powered by