Sunday, October 11, 2009

UTF8 in LAMP applications: overview and how to solve the common issues

The problem
In a LAMP application the text is frequently saved/retrieved from/to a database and files. We must consider all the different encoding (mapping characher-byte value): latin1 iso8859-1, latin9, UTF-8 (utf8) etc...
Lots of applications use ISO8859 encoding and some PHP functions to convert the characters (htmlentities, htmlspecialchars etc...)


The solution
Converting all the text from an encoding to another using PHP functions is unsafe, difficult and annoying.
Ad example the character "é" (encoded as iso8859) will printed as "é" if it's supposed to be encoded as utf8.
The solution is to use only one charset for files, Content-Type of the pages and the db. UTF-8 [wiki] is the best choice: a variable-lenght char encoding for the standard Unicode. If you use this charset in the HTML file, it won't need to convert the characters to the respective entities.

How To use UTF-8
  • set your IDE to save and open source files using utf8 encoding
  • set the content-type of your application to utf-8 (better a apache/htaccess rule instead of the meta tag).
  • set the database server to use utf8 encoding (also tables must be converted). If the db is utf8 but client encoding is latin1, execute first of all the query "SET NAMES utf8"
  • if the application was using latin1 and PHP convert functions, remove all the existing function to encode/decode special characters/entities.

No comments:

Post a Comment

 

PHP and tips|PHP