I'm setting up a new server and want to support UTF-8 fully in my web application. I have tried this in the past on existing servers and always seem to end up having to fall back to ISO-8859-1.
Where exactly do I need to set the encoding/charsets? I'm aware that I need to configure Apache, MySQL, and PHP to do this — is there some standard checklist I can follow, or perhaps troubleshoot where the mismatches occur?
This is for a new Linux server, running MySQL 5, PHP, 5 and Apache 2.
Data Storage:
Specify the utf8mb4 character set on all tables and text columns in your database. This makes MySQL physically store and retrieve values encoded natively in UTF-8. Note that MySQL will implicitly use utf8mb4 encoding if a utf8mb4_* collation is specified (without any explicit character set).
In older versions of MySQL (< 5.5.3), you'll unfortunately be forced to use simply utf8, which only supports a subset of Unicode characters. I wish I were kidding.
Data Access:
In your application code (e.g. PHP), in whatever DB access method you use, you'll need to set the connection charset to utf8mb4. This way, MySQL does no conversion from its native UTF-8 when it hands data off to your application and vice versa.
Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to be used on the connection—this is usually the preferred approach. In PHP:
If you're using the PDO abstraction layer with PHP ≥ 5.3.6, you can specify charset in the DSN:
$dbh = new PDO('mysql:charset=utf8mb4');
If you're using mysqli, you can call set_charset():
$mysqli->set_charset('utf8mb4'); // object oriented style
mysqli_set_charset($link, 'utf8mb4'); // procedural style
If you're stuck with plain mysql but happen to be running PHP ≥ 5.2.3, you can call mysql_set_charset.
If the driver does not provide its own mechanism for setting the connection character set, you may have to issue a query to tell MySQL how your application expects data on the connection to be encoded: SET NAMES 'utf8mb4'.
The same consideration regarding utf8mb4/utf8 applies as above.
Output:
UTF-8 should be set in the HTTP header, such as Content-Type: text/html; charset=utf-8. You can achieve that either by setting default_charset in php.ini (preferred), or manually using header() function.
If your application transmits text to other systems, they will also need to be informed of the character encoding. With web applications, the browser must be informed of the encoding in which data is sent (through HTTP response headers or HTML metadata).
When encoding the output using json_encode(), add JSON_UNESCAPED_UNICODE as a second parameter.
Input:
Browsers will submit data in the character set specified for the document, hence nothing particular has to be done on the input.
In case you have doubts about request encoding (in case it could be tampered with), you may verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's mb_check_encoding() does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.
Other Code Considerations:
Obviously enough, all files you'll be serving (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.
You need to make sure that every time you process a UTF-8 string, you do so safely. This is, unfortunately, the hard part. You'll probably want to make extensive use of PHP's mbstring extension.
PHP's built-in string operations are not by default UTF-8 safe. There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent mbstring function.
To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level. Check out any of the links from utf8.com for some good resources to learn everything you need to know.
I'd like to add one thing to chazomaticus' excellent answer:
Don't forget the META tag either (like this, or the HTML4 or XHTML version of it):
<meta charset="utf-8">
That seems trivial, but IE7 has given me problems with that before.
I was doing everything right; the database, database connection and Content-Type HTTP header were all set to UTF-8, and it worked fine in all other browsers, but Internet Explorer still insisted on using the "Western European" encoding.
It turned out the page was missing the META tag. Adding that solved the problem.
Edit:
The W3C actually has a rather large section dedicated to I18N. They have a number of articles related to this issue – describing the HTTP, (X)HTML and CSS side of things:
FAQ: Changing (X)HTML page encoding to UTF-8
Declaring character encodings in HTML
Tutorial: Character sets & encodings in XHTML, HTML and CSS
Setting the HTTP charset parameter
They recommend using both the HTTP header and HTML meta tag (or XML declaration in case of XHTML served as XML).
In addition to setting default_charset in php.ini, you can send the correct charset using header() from within your code, before any output:
header('Content-Type: text/html; charset=utf-8');
Working with Unicode in PHP is easy as long as you realize that most of the string functions don't work with Unicode, and some might mangle strings completely. PHP considers "characters" to be 1 byte long. Sometimes this is okay (for example, explode() only looks for a byte sequence and uses it as a separator -- so it doesn't matter what actual characters you look for). But other times, when the function is actually designed to work on characters, PHP has no idea that your text has multi-byte characters that are found with Unicode.
A good library to check into is phputf8. This rewrites all of the "bad" functions so you can safely work on UTF8 strings. There are extensions like the mb_string extension that try to do this for you, too, but I prefer using the library because it's more portable (but I write mass-market products, so that's important for me). But phputf8 can use mb_string behind the scenes, anyway, to increase performance.
Warning: This answer applies to PHP 5.3.5 and lower. Do not use it for PHP version 5.3.6 (released in March 2011) or later.
Compare with Palec's answer to PDO + MySQL and broken UTF-8 encoding.
I found an issue with someone using PDO and the answer was to use this for the PDO connection string:
$pdo = new PDO(
'mysql:host=mysql.example.com;dbname=example_db',
"username",
"password",
array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));
In my case, I was using mb_split, which uses regular expressions. Therefore I also had to manually make sure the regular expression encoding was UTF-8 by doing mb_regex_encoding('UTF-8');
As a side note, I also discovered by running mb_internal_encoding() that the internal encoding wasn't UTF-8, and I changed that by running mb_internal_encoding("UTF-8");.
First of all, if you are in PHP before 5.3 then no. You've got a ton of problems to tackle.
I am surprised that none has mentioned the intl library, the one that has good support for Unicode, graphemes, string operations, localisation and many more, see below.
I will quote some information about Unicode support in PHP by Elizabeth Smith's slides at PHPBenelux'14
INTL
Good:
Wrapper around ICU library
Standardised locales, set locale per script
Number formatting
Currency formatting
Message formatting (replaces gettext)
Calendars, dates, time zone and time
Transliterator
Spoofchecker
Resource bundles
Convertors
IDN support
Graphemes
Collation
Iterators
Bad:
Does not support zend_multibyte
Does not support HTTP input output conversion
Does not support function overloading
mb_string
Enables zend_multibyte support
Supports transparent HTTP in/out encoding
Provides some wrappers for functionality such as strtoupper
ICONV
Primary for charset conversion
Output buffer handler
mime encoding functionality
conversion
some string helpers (len, substr, strpos, strrpos)
Stream Filter stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')
DATABASES
MySQL: Charset and collation on tables and on the connection (not the collation). Also, don't use mysql - mysqli or PDO
postgresql: pg_set_client_encoding
sqlite(3): Make sure it was compiled with Unicode and intl support
Some other gotchas
You cannot use Unicode filenames with PHP and windows unless you use a 3rd part extension.
Send everything in ASCII if you are using exec, proc_open and other command line calls
Plain text is not plain text, files have encodings
You can convert files on the fly with the iconv filter
The only thing I would add to these amazing answers is to emphasize on saving your files in UTF-8 encoding, I have noticed that browsers accept this property over setting UTF-8 as your code encoding. Any decent text editor will show you this. For example, Notepad++ has a menu option for file encoding, and it shows you the current encoding and enables you to change it. For all my PHP files I use UTF-8 without a BOM.
Sometime ago I had someone ask me to add UTF-8 support for a PHP and MySQL application designed by someone else. I noticed that all files were encoded in ANSI, so I had to use iconv to convert all files, change the database tables to use the UTF-8 character set and utf8_general_ci collate, add 'SET NAMES utf8' to the database abstraction layer after the connection (if using 5.3.6 or earlier. Otherwise, you have to use charset=utf8 in the connection string) and change string functions to use the PHP multibyte string functions equivalent.
I recently discovered that using strtolower() can cause issues where the data is truncated after a special character.
The solution was to use
mb_strtolower($string, 'UTF-8');
mb_ uses MultiByte. It supports more characters but in general is a little slower.
In PHP, you'll need to either use the multibyte functions, or turn on mbstring.func_overload. That way things like strlen will work if you have characters that take more than one byte.
You'll also need to identify the character set of your responses. You can either use AddDefaultCharset, as above, or write PHP code that returns the header. (Or you can add a META tag to your HTML documents.)
I have just gone through the same issue and found a good solution at PHP manuals.
I changed all my files' encoding to UTF8 and then the default encoding on my connection. This solved all the problems.
if (!$mysqli->set_charset("utf8")) {
printf("Error loading character set utf8: %s\n", $mysqli->error);
} else {
printf("Current character set: %s\n", $mysqli->character_set_name());
}
View Source
Unicode support in PHP is still a huge mess. While it's capable of converting an ISO 8859 string (which it uses internally) to UTF-8, it lacks the capability to work with Unicode strings natively, which means all the string processing functions will mangle and corrupt your strings.
So you have to either use a separate library for proper UTF-8 support, or rewrite all the string handling functions yourself.
The easy part is just specifying the charset in HTTP headers and in the database and such, but none of that matters if your PHP code doesn't output valid UTF-8. That's the hard part, and PHP gives you virtually no help there. (I think PHP 6 is supposed to fix the worst of this, but that's still a while away.)
If you want a MySQL server to decide the character set, and not PHP as a client (old behaviour; preferred, in my opinion), try adding skip-character-set-client-handshake to your my.cnf, under [mysqld], and restart mysql.
This may cause trouble in case you're using anything other than UTF-8.
The top answer is excellent. Here is what I had to on a regular Debian, PHP, and MySQL setup:
// Storage
// Debian. Apparently already UTF-8
// Retrieval
// The MySQL database was stored in UTF-8,
// but apparently PHP was requesting ISO 8859-1. This worked:
// ***notice "utf8", without dash, this is a MySQL encoding***
mysql_set_charset('utf8');
// Delivery
// File *php.ini* did not have a default charset,
// (it was commented out, shared host) and
// no HTTP encoding was specified in the Apache headers.
// This made Apache send out a UTF-8 header
// (and perhaps made PHP actually send out UTF-8)
// ***notice "utf-8", with dash, this is a php encoding***
ini_set('default_charset','utf-8');
// Submission
// This worked in all major browsers once Apache
// was sending out the UTF-8 header. I didn’t add
// the accept-charset attribute.
// Processing
// Changed a few commands in PHP, like substr(),
// to mb_substr()
That was all!
In Javascript, window.atob() method decodes a base64 string and window.btoa() method encodes a string into base64.
Then why weren't they named like base64Decode() and base64Encode()?
atob() and btoa() don't make sense because they're not semantic at all.
I want to know the reason.
The atob() and btoa() methods allow authors to transform content to and from the base64 encoding.
In these APIs, for mnemonic purposes, the "b" can be considered to
stand for "binary", and the "a" for "ASCII". In practice, though, for
primarily historical reasons, both the input and output of these
functions are Unicode strings.
From : http://www.w3.org/TR/html/webappapis.html#atob
I know this is old, but it recently came up on Twitter, and I thought I'd share it as it is authoritative.
Me:
#BrendanEich did you pick those names?
Him:
Old Unix names, hard to find man pages rn but see
https://www.unix.com/man-page/minix/1/btoa/ …. The names carried over
from Unix into the Netscape codebase. I reflected them into JS in a
big hurry in 1995 (after the ten days in May but soon).
In case the Minix link breaks, here's the man page content:
BTOA(1) BTOA(1)
NAME
btoa - binary to ascii conversion
SYNOPSIS
btoa [-adhor] [infile] [outfile]
OPTIONS
-a Decode, rather than encode, the file
-d Extracts repair file from diagnosis file
-h Help menu is displayed giving the options
-o The obsolete algorithm is used for backward compatibility
-r Repair a damaged file
EXAMPLES
btoa <a.out >a.btoa # Convert a.out to ASCII
btoa -a <a.btoa >a.out
# Reverse the above
DESCRIPTION
Btoa is a filter that converts a binary file to ascii for transmission over a telephone
line. If two file names are provided, the first in used for input and the second for out-
put. If only one is provided, it is used as the input file. The program is a function-
ally similar alternative to uue/uud, but the encoding is completely different. Since both
of these are widely used, both have been provided with MINIX. The file is expanded about
25 percent in the process.
SEE ALSO
uue(1), uud(1).
Source: Brendan Eich, the creator of JavaScript. https://twitter.com/BrendanEich/status/998618208725684224
To sum up the already given answers:
atob stands for ASCII to binary
e.g.: atob("ZXhhbXBsZSELCg==") == "example!^K"
btoa stands for binary to ASCII
e.g.: btoa("\x01\x02\xfe\xff") == "AQL+/w=="
Why ASCII and binary:
ASCII (the a) is the result of base64 encoding. A safe text composed only of a subset of ascii characters(*) that can be correctly represented and transported (e.g. email's body),
binary (the b) is any stream of 0s and 1s (in javascript it must be represented with a string type).
(*) in base64 these are limited to: A-Z, a-z, 0-9, +, / and = (padding, only at the end) https://en.wikipedia.org/wiki/Base64
P.S. I must admit I myself was initially confused by the naming and thought the names were swapped. I thought that b stand for "base64 encoded string" and a for "any string" :D.
The names come from a unix function with similar functionality, but you can already read that in other answers here.
Here is my mnemonic to remember which one to use. This doesn't really answer the question itself, but might help people figure which one of the functions to use without keeping a tab open on this question on stack overflow all day long.
Beautiful to Awful btoa
Take something Beautiful (aka, beautiful content that would make sense to your application: json, xml, text, binary data) and transform it to something Awful, that cannot be understood as is (aka: encoded).
Awful to Beautiful atob
The exact opposite of btoa
Note
Some may say that binary is not beautiful, but hey, this is only a trick to help you.
I can't locate a source at the moment, but it is common knowledge that in this case, the b stands for 'binary', and the a for 'ASCII'.
Therefore, the functions are actually named:
ASCII to Binary for atob(), and
Binary to ASCII for btoa().
Note this is browser implementation, and was left for legacy / backwards-compatibility purposes. In Node.js for example, these don't exist.
Here is an overview of what 8 bit clean means.
In the context of web applications, why are images saved as Base64? There is a 33% overhead associated with being 8 bit clean.
If the transmission method is safe there is no need for this.
But basically, my images are saved in Base64 on the server, and transferred to the client, which as we all know can read Base64.
Here is the client side version of Base 64 in an SO Post.
How can you encode a string to Base64 in JavaScript?
Is http/https 8 bit clean?
Reference
http://www.princeton.edu/~achaney/tmve/wiki100k/docs/8-bit_clean.html
http://en.wikipedia.org/wiki/8-bit_clean
You are asking two different things.
Q: Is http 8 bit clean?
A: yes HTTP is "bit 8 clean".
Q: In the context of web applications, why are images saved as Base64?
A: images are not usually saved in Base64. In fact, they are almost never. They are usually saved or transmitted or streamed in compressed binary format (PNG or JPG or similar)
Base64 is used to embed images inside the HTML.
So, you got an image logo.png. You include it statically in your page as <img src='logo.png'>. The image is transmitted thru HTTP in binary, no encoding in neither browser nor server side. This is the most common case.
Alternatively, you might decide to embed the contents of the image inside the HTML. It has some advantages: The browser will not need to do a second trip to the server to fetch the image, because the browser has already received it in the same HTTP GET response of the HTML file. But some disadvantages, because HTML files are text and certain character values may have special meaning for HTML (not for HTTP), you cannot just embed the binary values inside the HTML text. You have to encode them to avoid such collisions. The most usual encoding method is base64, which avoids all the collisions with only a 33% of overhead.
RFC 2616s abstract states:
A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.
HTTP always starts with a text-only header and in this header the content-type is specified.
As long as sender and receiver agree on this contents type anything is possible.
HTTP relies on a reliable (recognize the wordplay) transport layer such as TCP. HTTPS only adds security to the transport layer (or between the transport layer and HTTP, not sure about this).
So yep, http(s) is 8 bit clean.
In addition to PAs answer and your question "But why use an encoding method that adds 33% overhead, when you don't need it?": because that's part of a different concept!
HTTP transfers data of any kind, and the http-content may be an html file with an embedded picture. But after receiving that html file a browser or some other renderer has to interpret the html content. And that follows different standards, which require arbitrary data to be encoded. html is not 8-bit clean, in fact it is not even 7-bit clean as there are many restrictions on the characters used and their order of appearance.
In the context of web applications, why are images saved as Base64?
There is a 33% overhead associated with being 8 bit clean.
Base64 is used to allow 8-bit binary data to be presented as printable text within the ASCII definition. This is only 7-bits, not 8 as the last 128 characters would be depending on set encoding (Latin1, UTF8 etc.) which means that the encoded data could be mangled if a different encoding type was set at client/receiver end compared to source.
As there aren't enough printable characters within ASCII to represent all 8-bit values (which has absolute values and aren't dependent on encoding itself) you need to "water out the bits" and base-64 keeps high enough numbers to enable the bytes to be represented as printable chars.
This is the 33% overhead you see as the byte values representing characters outside the printable range must be shifted to a value that becomes printable within the ASCII table; Base-64 allows this (you could also use quoted printable which was common in the past, ie. with Usenet, email etc.).
I'm thinking about writing another encoding type to remove the overhead.
Good luck :-)
Related to the query
Is HTTP 8-bit clean ?
HTTP protocol is not in entirety a 8-bit clean protocol.
HTTP Entity Body is 8-bit clean since there is a provision to suggest the content-type, allowing content-negotiation between the interacting entities as pointed by everyone in this thread.
However the request line , the headers and the status line are not 8-bit clean.
In order to send any binary information as part of
the request line, as part of query parameters / path segments
header
one must use one of the binary-to-text encoding to preserve the binary values.
For instance when sending a signature as part of query parameters or headers , which is the case of signed URL technique employed by CDN , the signature a binary information has to be encoded to preserve the binary value of it.
So as an example, when I read the π character (\u03C0) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob), I get the result \xcf\x80 instead, which doesn't seem to have any visible correlation with the pi character. What's going on? (This probably has something to do with the way UTF-8/16 is encoded...)
FileReader.readAsText takes the encoding of the file into account. In particular, since you have the file encoded in UTF-8, there may be multiple bytes per character. Reading it as text, the UTF-8 is read as it is, and you get your string.
FileReader.readAsBinaryString, on the other hand, does exactly what it says. It reads the file byte by byte. It doesn't recognise multi-byte characters, which in particular is good news for binary files (basically anything except a text file). Since π is a two-byte character, you get the two individual bytes that make it up in your string.
This difference can be seen in many places. In particular when encoding is lost and you see characters like é displayed as é.
Oh well, if that's all you needed... :)
CF80 is the UTF-8 encoding for π.
I'm trying to scrape some Japanese websites for a personal project. Sites with text in UTF-8 work perfectly fine, as you'd expect, but I can't get any text out of sites specifying other international encodings, specifically EUC-JP. Node also seems to be interpreting the text and performing modifications rather than passing it on raw - I've tried setting the response to be interpreted as both ascii and binary, and then set my terminal application to EUC-JP, but after doing a console.log(), neither result in the actual text.
I've had a scan through the Node documentation, and it seems to only support two main text encodings (apart from binary and base64.)
I'm using the inbuilt http client, and specifying the encoding through the response.setEncoding method, e.g. response.setEncoding('utf8');
How are other people working with international text in Node (especially with regard to situations where the original data is not in UTF-8?) Are binary buffers the only way?
While I've done a bit of research, I'm not hugely knowledgeable when it comes to character encoding, so simple answers would be appreciated. Thanks!
There is a module that adds iconv bindings to node.js. If you grab the response as a binary Buffer, you can use Iconv.convert to convert it from EUC-JP to UTF-8 (take a look at the README for an example).