Table of Contents
MySQL includes character set support that enables you to store data
using a variety of character sets and perform comparisons according
to a variety of collations. You can specify character sets at the
server, database, table, and column level. MySQL supports the use of
character sets for the MyISAM
,
MEMORY
, NDBCluster
, and
InnoDB
storage engines.
This chapter discusses the following topics:
What are character sets and collations?
The multiple-level default system for character set assignment
Syntax for specifying character sets and collations
Affected functions and operations
Unicode support
The character sets and collations that are available, with notes
Character set issues affect data storage, but also communication
between client programs and the MySQL server. If you want the client
program to communicate with the server using a character set
different from the default, you'll need to indicate which one. For
example, to use the utf8
Unicode character set,
issue this statement after connecting to the server:
SET NAMES 'utf8';
For more information about character set-related issues in client/server communication, see Section 10.4, “Connection Character Sets and Collations”.
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters:
‘A
’,
‘B
’,
‘a
’,
‘b
’. We give each letter a number:
‘A
’ = 0,
‘B
’ = 1,
‘a
’ = 2,
‘b
’ = 3. The letter
‘A
’ is a symbol, the number 0 is
the encoding for
‘A
’, and the combination of all
four letters and their encodings is a
character set.
Suppose that we want to compare two string values,
‘A
’ and
‘B
’. The simplest way to do this is
to look at the encodings: 0 for ‘A
’
and 1 for ‘B
’. Because 0 is less
than 1, we say ‘A
’ is less than
‘B
’. What we've just done is apply
a collation to our character set. The collation is a set of rules
(only one rule in this case): “compare the
encodings.” We call this simplest of all possible
collations a binary collation.
But what if we want to say that the lowercase and uppercase
letters are equivalent? Then we would have at least two rules: (1)
treat the lowercase letters ‘a
’ and
‘b
’ as equivalent to
‘A
’ and
‘B
’; (2) then compare the
encodings. We call this a case-insensitive
collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just
‘A
’ and
‘B
’ but whole alphabets, sometimes
multiple alphabets or eastern writing systems with thousands of
characters, along with many special symbols and punctuation marks.
Also in real life, most collations have many rules, not just for
whether to distinguish lettercase, but also for whether to
distinguish accents (an “accent” is a mark attached
to a character as in German ‘Ö
’),
and for multiple-character mappings (such as the rule that
‘Ö
’ =
‘OE
’ in one of the two German
collations).
MySQL can do these things for you:
Store strings using a variety of character sets
Compare strings using a variety of collations
Mix strings with different character sets or collations in the same server, the same database, or even the same table
Allow specification of character set and collation at any level
In these respects, MySQL is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
The MySQL server can support multiple character sets. To list the
available character sets, use the SHOW CHARACTER
SET
statement. A partial listing follows. For more
complete information, see Section 10.9, “Character Sets and Collations That MySQL Supports”.
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
...
Any given character set always has at least one collation. It may
have several collations. To list the collations for a character
set, use the SHOW COLLATION
statement. For
example, to see the collations for the latin1
(cp1252 West European) character set, use this statement to find
those collation names that begin with latin1
:
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+---------------------+---------+----+---------+----------+---------+
The latin1
collations have the following
meanings:
Collation | Meaning |
latin1_german1_ci | German DIN-1 |
latin1_swedish_ci | Swedish/Finnish |
latin1_danish_ci | Danish/Norwegian |
latin1_german2_ci | German DIN-2 |
latin1_bin | Binary according to latin1 encoding |
latin1_general_ci | Multilingual (Western European) |
latin1_general_cs | Multilingual (ISO Western European), case sensitive |
latin1_spanish_ci | Modern Spanish |
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the
default collation. For example, the
default collation for latin1
is
latin1_swedish_ci
. The output for
SHOW CHARACTER SET
indicates which
collation is the default for each displayed character set.
There is a convention for collation names: They start with the
name of the character set with which they are associated, they
usually include a language name, and they end with
_ci
(case insensitive),
_cs
(case sensitive), or
_bin
(binary).
There are default settings for character sets and collations at four levels: server, database, table, and column. The following description may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.
CHARACTER SET
is used in clauses that specify a
character set. CHARSET
may be used as a synonym
for CHARACTER SET
.
MySQL Server has a server character set and a server collation. These can be set at server startup and changed at runtime.
Initially, the server character set and collation depend on the
options that you use when you start mysqld.
You can use --character-set-server
for the
character set. Along with it, you can add
--collation-server
for the collation. If you
don't specify a character set, that is the same as saying
--character-set-server=latin1
. If you specify
only a character set (for example, latin1
)
but not a collation, that is the same as saying
--character-set-server=latin1
--collation-server=latin1_swedish_ci
because
latin1_swedish_ci
is the default collation
for latin1
. Therefore, the following three
commands all have the same effect:
shell>mysqld
shell>mysqld --character-set-server=latin1
shell>mysqld --character-set-server=latin1 \
--collation-server=latin1_swedish_ci
One way to change the settings is by recompiling. If you want to
change the default server character set and collation when
building from sources, use: --with-charset
and
--with-collation
as arguments for
configure. For example:
shell> ./configure --with-charset=latin1
Or:
shell>./configure --with-charset=latin1 \
--with-collation=latin1_german1_ci
Both mysqld and configure verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.
The current server character set and collation can be determined
from the values of the character_set_server
and collation_server
system variables. These
variables can be changed at runtime.
Every database has a database character set and a database
collation. The CREATE DATABASE
and
ALTER DATABASE
statements have optional
clauses for specifying the database character set and collation:
CREATE DATABASEdb_name
[[DEFAULT] CHARACTER SETcharset_name
] [[DEFAULT] COLLATEcollation_name
] ALTER DATABASEdb_name
[[DEFAULT] CHARACTER SETcharset_name
] [[DEFAULT] COLLATEcollation_name
]
The keyword SCHEMA
can be used instead of
DATABASE
.
All database options are stored in a text file named
db.opt
that can be found in the database
directory.
The CHARACTER SET
and
COLLATE
clauses make it possible to create
databases with different character sets and collations on the
same MySQL server.
Example:
CREATE DATABASE db_name
CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database collation in the following manner:
If both CHARACTER SET
and X
COLLATE
were specified, then
character set Y
X
and collation
Y
.
If CHARACTER SET
was specified without
X
COLLATE
, then character set
X
and its default collation.
If COLLATE
was specified without Y
CHARACTER SET
, then
the character set associated with
Y
and collation
Y
.
Otherwise, the server character set and server collation.
The database character set and collation are used as default
values if the table character set and collation are not
specified in CREATE TABLE
statements. They
have no other purpose.
The character set and collation for the default database can be
determined from the values of the
character_set_database
and
collation_database
system variables. The
server sets these variables whenever the default database
changes. If there is no default database, the variables have the
same value as the corresponding server-level system variables,
character_set_server
and
collation_server
.
Every table has a table character set and a table collation. The
CREATE TABLE
and ALTER
TABLE
statements have optional clauses for specifying
the table character set and collation:
CREATE TABLEtbl_name
(column_list
) [[DEFAULT] CHARACTER SETcharset_name
] [COLLATEcollation_name
]] ALTER TABLEtbl_name
[[DEFAULT] CHARACTER SETcharset_name
] [COLLATEcollation_name
]
Example:
CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation in the following manner:
If both CHARACTER SET
and X
COLLATE
were specified, then
character set Y
X
and collation
Y
.
If CHARACTER SET
was specified without
X
COLLATE
, then character set
X
and its default collation.
If COLLATE
was specified without Y
CHARACTER SET
, then
the character set associated with
Y
and collation
Y
.
Otherwise, the database character set and collation.
The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
Every “character” column (that is, a column of type
CHAR
, VARCHAR
, or
TEXT
) has a column character set and a column
collation. Column definition syntax has optional clauses for
specifying the column character set and collation:
col_name
{CHAR | VARCHAR | TEXT} (col_length
) [CHARACTER SETcharset_name
] [COLLATEcollation_name
]
Example:
CREATE TABLE Table1 ( column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci );
MySQL chooses the column character set and collation in the following manner:
If both CHARACTER SET
and X
COLLATE
were specified, then
character set Y
X
and collation
Y
are used.
If CHARACTER SET
was specified without
X
COLLATE
, then character set
X
and its default collation are
used.
If COLLATE
was specified without Y
CHARACTER SET
, then
the character set associated with
Y
and collation
Y
.
Otherwise, the table character set and collation are used.
The CHARACTER SET
and
COLLATE
clauses are standard SQL.
Every character string literal has a character set and a collation.
A character string literal may have an optional character set
introducer and COLLATE
clause:
[_charset_name
]'string
' [COLLATEcollation_name
]
Examples:
SELECT 'string
'; SELECT _latin1'string
'; SELECT _latin1'string
' COLLATE latin1_danish_ci;
For the simple statement SELECT
'
, the string has
the character set and collation defined by the
string
'character_set_connection
and
collation_connection
system variables.
The _
expression is formally called an
introducer. It tells the parser, “the
string that is about to follow uses character set
charset_name
X
.” Because this has confused
people in the past, we emphasize that an introducer does not
cause any conversion; it is strictly a signal that does not
change the string's value. An introducer is also legal before
standard hex literal and numeric hex literal notation
(x'
and
literal
'0x
)>.
nnnn
Examples:
SELECT _latin1 x'AABBCC'; SELECT _latin1 0xAABBCC;
MySQL determines a literal's character set and collation in the following manner:
If both _X
and COLLATE
were specified, then
character set Y
X
and collation
Y
are used.
If _X
is specified but
COLLATE
is not specified, then character
set X
and its default collation
are used.
Otherwise, the character set and collation given by the
character_set_connection
and
collation_connection
system variables are
used.
Examples:
A string with latin1
character set and
latin1_german1_ci
collation:
SELECT _latin1'Müller' COLLATE latin1_german1_ci;
A string with latin1
character set and
its default collation (that is,
latin1_swedish_ci
):
SELECT _latin1'Müller';
A string with the connection default character set and collation:
SELECT 'Müller';
Character set introducers and the COLLATE
clause are implemented according to standard SQL specifications.
Standard SQL defines NCHAR
or
NATIONAL CHAR
as a way to indicate that a
CHAR
column should use some predefined
character set. MySQL 5.0 uses
utf8
as this predefined character set. For
example, these data type declarations are equivalent:
CHAR(10) CHARACTER SET utf8 NATIONAL CHARACTER(10) NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8 NATIONAL VARCHAR(10) NCHAR VARCHAR(10) NATIONAL CHARACTER VARYING(10) NATIONAL CHAR VARYING(10)
You can use
N'
to
create a string in the national character set. These two
statements are equivalent:
literal
'
SELECT N'some text'; SELECT _utf8'some text';
For information on upgrading character sets to MySQL 5.0 from versions prior to 4.1, see the MySQL 3.23, 4.0, 4.1 Reference Manual.
The following examples show how MySQL determines default character set and collation values.
Example 1: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci ) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
Here we have a column with a latin1
character
set and a latin1_german1_ci
collation. The
definition is explicit, so that's straightforward. Notice that
there is no problem with storing a latin1
column in a latin2
table.
Example 2: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
This time we have a column with a latin1
character set and a default collation. Although it might seem
natural, the default collation is not taken from the table
level. Instead, because the default collation for
latin1
is always
latin1_swedish_ci
, column
c1
has a collation of
latin1_swedish_ci
(not
latin1_danish_ci
).
Example 3: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
We have a column with a default character set and a default
collation. In this circumstance, MySQL checks the table level to
determine the column character set and collation. Consequently,
the character set for column c1
is
latin1
and its collation is
latin1_danish_ci
.
Example 4: Database, Table, and Column Definition
CREATE DATABASE d1 DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci; USE d1; CREATE TABLE t1 ( c1 CHAR(10) );
We create a column without specifying its character set and
collation. We're also not specifying a character set and a
collation at the table level. In this circumstance, MySQL checks
the database level to determine the table settings, which
thereafter become the column settings.) Consequently, the
character set for column c1
is
latin2
and its collation is
latin2_czech_ci
.
Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:
The server character set and collation can be determined from
the values of the character_set_server
and
collation_server
system variables.
The character set and collation of the default database can be
determined from the values of the
character_set_database
and
collation_database
system variables.
Additional character set and collation system variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation system variables.
Consider what a “connection” is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:
What character set is the statement in when it leaves the client?
The server takes the character_set_client
system variable to be the character set in which statements
are sent by the client.
What character set should the server translate a statement to after receiving it?
For this, the server uses the
character_set_connection
and
collation_connection
system variables. It
converts statements sent by the client from
character_set_client
to
character_set_connection
(except for string
literals that have an introducer such as
_latin1
or _utf8
).
collation_connection
is important for
comparisons of literal strings. For comparisons of strings
with column values, collation_connection
does not matter because columns have their own collation,
which has a higher collation precedence.
What character set should the server translate to before shipping result sets or error messages back to the client?
The character_set_results
system variable
indicates the character set in which the server returns query
results to the client. This includes result data such as
column values, and result metadata such as column names.
You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip the rest of this section).
There are two statements that affect the connection character sets:
SET NAMES 'charset_name
' SET CHARACTER SETcharset_name
SET NAMES
indicates what character set the
client will use to send SQL statements to the server. Thus,
SET NAMES 'cp1251'
tells the server
“future incoming messages from this client are in character
set cp1251
.” It also specifies the
character set that the server should use for sending results back
to the client. (For example, it indicates what character set to
use for column values if you use a SELECT
statement.)
A SET NAMES '
statement is equivalent to these three statements:
x
'
SET character_set_client =x
; SET character_set_results =x
; SET character_set_connection =x
;
Setting character_set_connection
to
x
also sets
collation_connection
to the default collation
for x
. To specify one of the character
set's collations explicitly, use the optional
COLLATE
clause:
SET NAMES 'charset_name
' COLLATE 'collation_name
'
SET CHARACTER SET
is similar to SET
NAMES
but sets the connection character set and
collation to be those of the default database. A SET
CHARACTER SET
statement is
equivalent to these three statements:
x
SET character_set_client =x
; SET character_set_results =x
; SET collation_connection = @@collation_database;
Setting collation_connection
also sets
character_set_connection
to the character set
associated with the collation.
When a client connects, it sends to the server the name of the
character set that it wants to use. The server uses the name to
set the character_set_client
,
character_set_results
, and
character_set_connection
system variables. In
effect, the server performs a SET NAMES
operation using the character set name.
With the mysql client, it is not necessary to
execute SET NAMES
every time you start up if
you want to use a character set different from the default. You
can add the --default-character-set
option
setting to your mysql statement line, or in
your option file. For example, the following option file setting
changes the three character set variables set to
koi8r
each time you invoke
mysql:
[mysql] default-character-set=koi8r
Example: Suppose that column1
is defined as
CHAR(5) CHARACTER SET latin2
. If you do not say
SET NAMES
or SET CHARACTER
SET
, then for SELECT column1 FROM t
,
the server sends back all the values for
column1
using the character set that the client
specified when it connected. On the other hand, if you say
SET NAMES 'latin1'
or SET CHARACTER
SET latin1
before issuing the SELECT
statement, the server converts the latin2
values to latin1
just before sending results
back. Conversion may be lossy if there are characters that are not
in both character sets.
If you do not want the server to perform any conversion of result
sets, set character_set_results
to
NULL
:
SET character_set_results = NULL;
Note: Currently, UCS-2 cannot be
used as a client character set, which means that SET
NAMES 'ucs2'
does not work.
To see the values of the character set and collation system variables that apply to your connection, use these statements:
SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
The following sections various aspects of character set collations.
With the COLLATE
clause, you can override
whatever the default collation is for a comparison.
COLLATE
may be used in various parts of SQL
statements. Here are some examples:
With ORDER BY
:
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
With AS
:
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
With GROUP BY
:
SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
With DISTINCT
:
SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
With WHERE
:
SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
With HAVING
:
SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
The COLLATE
clause has high precedence
(higher than ||
), so the following two
expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
The BINARY
operator casts the string
following it to a binary string. This is an easy way to force a
comparison to be done byte by byte rather than character by
character. BINARY
also causes trailing spaces
to be significant.
mysql>SELECT 'a' = 'A';
-> 1 mysql>SELECT BINARY 'a' = 'A';
-> 0 mysql>SELECT 'a' = 'a ';
-> 1 mysql>SELECT BINARY 'a' = 'a ';
-> 0
BINARY
is
shorthand for str
CAST(
.
str
AS
BINARY)
The BINARY
attribute in character column
definitions has a different effect. A character column defined
with the BINARY
attribute is assigned the
binary collation of the column's character set. Every character
set has a binary collation. For example, the binary collation
for the latin1
character set is
latin1_bin
, so if the table default character
set is latin1
, these two column definitions
are equivalent:
CHAR(10) BINARY CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
The effect of BINARY
as a column attribute
differs from its effect prior to MySQL 4.1. Formerly,
BINARY
resulted in a column that was treated
as a binary string. A binary string is a string of bytes that
has no character set or collation, which differs from a
non-binary character string that has a binary collation. For
both types of strings, comparisons are based on the numeric
values of the string unit, but for non-binary strings the unit
is the character and some character sets allow multi-byte
characters. Section 11.4.2, “The BINARY
and VARBINARY
Types”.
The use of CHARACTER SET binary
in the
definition of a CHAR
,
VARCHAR
, or TEXT
column
causes the column to be treated as a binary data type. For
example, the following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary BINARY(10) VARCHAR(10) CHARACTER SET binary VARBINARY(10) TEXT CHARACTER SET binary BLOB
In the great majority of statements, it is obvious what
collation MySQL uses to resolve a comparison operation. For
example, in the following cases, it should be clear that the
collation is the collation of column x
:
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should this query use the collation of the column
x
, or of the string literal
'Y'
?
Standard SQL resolves such questions using what used to be
called “coercibility” rules. Basically, this means:
Both x
and 'Y'
have
collations, so which collation takes precedence? This can be
difficult to resolve, but the following rules cover most
situations:
An explicit COLLATE
clause has a
coercibility of 0. (Not coercible at all.)
The concatenation of two strings with different collations has a coercibility of 1.
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
A “system constant” (the string returned by
functions such as USER()
or
VERSION()
) has a coercibility of 3.
A literal's collation has a coercibility of 4.
NULL
or an expression that is derived
from NULL
has a coercibility of 5.
The preceding coercibility values are current as of MySQL 5.0.3.
In MySQL 5.0 prior to 5.0.3, there is no system
constant or ignorable coercibility. Functions such as
USER()
have a coercibility of 2 rather than
3, and literals have a coercibility of 3 rather than 4.
Those rules resolve ambiguities in the following manner:
Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then it is an error if the collations aren't the same.
Examples:
column1 = 'A' | Use collation of column1 |
column1 = 'A' COLLATE x | Use collation of 'A' COLLATE x |
column1 COLLATE x = 'A' COLLATE y | Error |
The COERCIBILITY()
function can be used to
determine the coercibility of a string expression:
mysql>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0 mysql>SELECT COERCIBILITY(VERSION());
-> 3 mysql>SELECT COERCIBILITY('A');
-> 4
Each character set has one or more collations, but each
collation is associated with one and only one character set.
Therefore, the following statement causes an error message
because the latin2_bin
collation is not legal
with the latin1
character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
Suppose that column X
in table
T
has these latin1
column
values:
Muffler Müller MX Systems MySQL
Suppose also that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name
;
The following table shows the resulting order of the values if
we use ORDER BY
with different collations:
latin1_swedish_ci | latin1_german1_ci | latin1_german2_ci |
Muffler | Muffler | Müller |
MX Systems | Müller | Muffler |
Müller | MX Systems | MX Systems |
MySQL | MySQL | MySQL |
The character that causes the different sort orders in this
example is the U with two dots over it
(ü
), which the Germans call
“U-umlaut.”
The first column shows the result of the
SELECT
using the Swedish/Finnish
collating rule, which says that U-umlaut sorts with Y.
The second column shows the result of the
SELECT
using the German DIN-1 rule, which
says that U-umlaut sorts with U.
The third column shows the result of the
SELECT
using the German DIN-2 rule, which
says that U-umlaut sorts with UE.
This section describes operations that take character set information into account.
MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?
For simple functions that take string input and return a string
result as output, the output's character set and collation are
the same as those of the principal input value. For example,
UPPER(
returns a
string whose character string and collation are the same as that
of X
)X
. The same applies for
INSTR()
, LCASE()
,
LOWER()
, LTRIM()
,
MID()
, REPEAT()
,
REPLACE()
, REVERSE()
,
RIGHT()
, RPAD()
,
RTRIM()
, SOUNDEX()
,
SUBSTRING()
, TRIM()
,
UCASE()
, and UPPER()
.
Note: The REPLACE()
function, unlike all
other functions, always ignores the collation of the string
input and performs a case-sensitive comparison.
If a string input or function result is a binary string, the
string has no character set or collation. This can be check by
using the CHARSET()
and
COLLATION()
functions, both of which return
binary
to indicate that their argument is a
binary string:
mysql> SELECT CHARSET(BINARY 'a'), COLLATION(BINARY 'a');
+---------------------+-----------------------+
| CHARSET(BINARY 'a') | COLLATION(BINARY 'a') |
+---------------------+-----------------------+
| binary | binary |
+---------------------+-----------------------+
For operations that combine multiple string inputs and return a single string output, the “aggregation rules” of standard SQL apply for determining the collation of the result:
If an explicit COLLATE
occurs, use
X
X
.
If explicit COLLATE
and X
COLLATE
occur, raise an
error.
Y
Otherwise, if all collations are
X
, use
X
.
Otherwise, the result has no collation.
For example, with CASE ... WHEN a THEN b WHEN b THEN c
COLLATE
, the
resulting collation is X
ENDX
. The same
applies for UNION
, ||
,
CONCAT()
, ELT()
,
GREATEST()
, IF()
, and
LEAST()
.
For operations that convert to character data, the character set
and collation of the strings that result from the operations are
defined by the character_set_connection
and
collation_connection
system variables. This
applies to CAST()
, CONV()
,
FORMAT()
, HEX()
,
SPACE()
. Before MySQL 5.0.15, it also applies
to CHAR()
.
CONVERT()
provides a way to convert data
between different character sets. The syntax is:
CONVERT(expr
USINGtranscoding_name
)
In MySQL, transcoding names are the same as the corresponding character set names.
Examples:
SELECT CONVERT(_latin1'Müller' USING utf8); INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;
CONVERT(... USING ...)
is implemented
according to the standard SQL specification.
You may also use CAST()
to convert a string
to a different character set. The syntax is:
CAST(character_string
AScharacter_data_type
CHARACTER SETcharset_name
)
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
If you use CAST()
without specifying
CHARACTER SET
, the resulting character set
and collation are defined by the
character_set_connection
and
collation_connection
system variables. If you
use CAST()
with CHARACTER SET
X
, the resulting character set and collation are
X
and the default collation of
X
.
You may not use a COLLATE
clause inside a
CAST()
, but you may use it outside. That is,
CAST(... COLLATE ...)
is illegal, but
CAST(...) COLLATE ...
is legal.
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
Several SHOW
statements provide additional
character set information. These include SHOW CHARACTER
SET
, SHOW COLLATION
, SHOW
CREATE DATABASE
, SHOW CREATE TABLE
and SHOW COLUMNS
. These statements are
described here briefly. For more information, see
Section 13.5.4, “SHOW
Syntax”.
INFORMATION_SCHEMA
has several tables that
contain information similar to that displayed by the
SHOW
statements. For example, the
CHARACTER_SETS
and
COLLATIONS
tables contain the information
displayed by SHOW CHARACTER SET
and
SHOW COLLATION
.
Chapter 20, The INFORMATION_SCHEMA
Database.
The SHOW CHARACTER SET
command shows all
available character sets. It takes an optional
LIKE
clause that indicates which character
set names to match. For example:
mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
+---------+-----------------------------+-------------------+--------+
The output from SHOW COLLATION
includes all
available character sets. It takes an optional
LIKE
clause that indicates which collation
names to match. For example:
mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 0 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 0 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+-------------------+---------+----+---------+----------+---------+
SHOW CREATE DATABASE
displays the
CREATE DATABASE
statement that creates a
given database:
mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database |
+----------+-----------------------------------------------------------------+
| test | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
If no COLLATE
clause is shown, the default
collation for the character set applies.
SHOW CREATE TABLE
is similar, but displays
the CREATE TABLE
statement to create a given
table. The column definitions indicate any character set
specifications, and the table options include character set
information.
The SHOW COLUMNS
statement displays the
collations of a table's columns when invoked as SHOW
FULL COLUMNS
. Columns with CHAR
,
VARCHAR
, or TEXT
data
types have collations. Numeric and other non-character types
have no collation (indicated by NULL
as the
Collation
value). For example:
mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
Field: id
Type: smallint(5) unsigned
Collation: NULL
Null: NO
Key: PRI
Default: NULL
Extra: auto_increment
Privileges: select,insert,update,references
Comment:
*************************** 2. row ***************************
Field: name
Type: char(60)
Collation: latin1_swedish_ci
Null: NO
Key:
Default:
Extra:
Privileges: select,insert,update,references
Comment:
The character set is not part of the display but is implied by the collation name.
MySQL 5.0 supports two character sets for storing Unicode data:
ucs2
, the UCS-2 Unicode character set.
utf8
, the UTF-8 encoding of the Unicode
character set.
In UCS-2 (binary Unicode representation), every character is
represented by a two-byte Unicode code with the most significant
byte first. For example: LATIN CAPITAL LETTER A
has the code 0x0041
and it is stored as a
two-byte sequence: 0x00 0x41
. CYRILLIC
SMALL LETTER YERU
(Unicode 0x044B
) is
stored as a two-byte sequence: 0x04 0x4B
. For
Unicode characters and their codes, please refer to the
Unicode Home Page.
Currently, UCS-2 cannot be used as a client character set, which
means that SET NAMES 'ucs2'
does not work.
The UTF-8 character set (transform Unicode representation) is an alternative way to store Unicode data. It is implemented according to RFC 3629. The idea of the UTF-8 character set is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte sequences.
RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL support for UTF-8 does not include four-byte sequences. (An older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
Tip: To save space with UTF-8,
use VARCHAR
instead of CHAR
.
Otherwise, MySQL must reserve three bytes for each character in a
CHAR CHARACTER SET utf8
column because that is
the maximum possible length. For example, MySQL must reserve 30
bytes for a CHAR(10) CHARACTER SET utf8
column.
Metadata is “the data about the
data.” Anything that describes the
database — as opposed to being the
contents of the database — is metadata.
Thus column names, database names, usernames, version names, and
most of the string results from SHOW
are
metadata. This is also true of the contents of tables in
INFORMATION_SCHEMA
, because those tables by
definition contain information about database objects.
Representation of metadata must satisfy these requirements:
All metadata must be in the same character set. Otherwise,
neither the SHOW
commands nor
SELECT
statements for tables in
INFORMATION_SCHEMA
would work properly
because different rows in the same column of the results of
these operations would be in different character sets.
Metadata must include all characters in all languages. Otherwise, users would not be able to name columns and tables using their own languages.
To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.
The metadata requirements mean that the return values of the
USER()
, CURRENT_USER()
,
SESSION_USER()
,
SYSTEM_USER()
, DATABASE()
,
and VERSION()
functions have the UTF-8
character set by default.
The server sets the character_set_system
system
variable to the name of the metadata character set:
mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| character_set_system | utf8 |
+----------------------+-------+
Storage of metadata using Unicode does not
mean that the server returns headers of columns and the results of
DESCRIBE
functions in the
character_set_system
character set by default.
When you use SELECT column1 FROM t
, the name
column1
itself is returned from the server to
the client in the character set determined by the value of the
character_set_results
system variable, which
has a default value of latin1
. If you want the
server to pass metadata results back in a different character set,
use the SET NAMES
statement to force the server
to perform character set conversion. SET NAMES
sets the character_set_results
and other
related system variables. (See
Section 10.4, “Connection Character Sets and Collations”.) Alternatively, a client
program can perform the conversion after receiving the result from
the server. It is more efficient for the client perform the
conversion, but this option is not always available for all
clients.
If character_set_results
is set to
NULL
, no conversion is performed and the server
returns metadata using its original character set (the set
indicated by character_set_system
).
Error messages returned from the server to the client are converted to the client character set automatically, as with metadata.
If you are using (for example) the USER()
function for comparison or assignment within a single statement,
don't worry. MySQL performs some automatic conversion for you.
SELECT * FROM Table1 WHERE USER() = latin1_column;
This works because the contents of
latin1_column
are automatically converted to
UTF-8 before the comparison.
INSERT INTO Table1 (latin1_column) SELECT USER();
This works because the contents of USER()
are
automatically converted to latin1
before the
assignment. Automatic conversion is not fully implemented yet, but
should work correctly in a later version.
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.
You can always list the available character sets and their default
collations with the SHOW CHARACTER SET
statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset | Description | Default collation |
+----------+-----------------------------+---------------------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | cp1252 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |
| swe7 | 7bit Swedish | swe7_swedish_ci |
| ascii | US ASCII | ascii_general_ci |
| ujis | EUC-JP Japanese | ujis_japanese_ci |
| sjis | Shift-JIS Japanese | sjis_japanese_ci |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
| tis620 | TIS620 Thai | tis620_thai_ci |
| euckr | EUC-KR Korean | euckr_korean_ci |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
| greek | ISO 8859-7 Greek | greek_general_ci |
| cp1250 | Windows Central European | cp1250_general_ci |
| gbk | GBK Simplified Chinese | gbk_chinese_ci |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
| utf8 | UTF-8 Unicode | utf8_general_ci |
| ucs2 | UCS-2 Unicode | ucs2_general_ci |
| cp866 | DOS Russian | cp866_general_ci |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
| macce | Mac Central European | macce_general_ci |
| macroman | Mac West European | macroman_general_ci |
| cp852 | DOS Central European | cp852_general_ci |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
| cp1251 | Windows Cyrillic | cp1251_general_ci |
| cp1256 | Windows Arabic | cp1256_general_ci |
| cp1257 | Windows Baltic | cp1257_general_ci |
| binary | Binary pseudo charset | binary |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.
ucs2
(UCS-2 Unicode) collations:
ucs2_bin
ucs2_czech_ci
ucs2_danish_ci
ucs2_esperanto_ci
ucs2_estonian_ci
ucs2_general_ci
(default)
ucs2_hungarian_ci
ucs2_icelandic_ci
ucs2_latvian_ci
ucs2_lithuanian_ci
ucs2_persian_ci
ucs2_polish_ci
ucs2_roman_ci
ucs2_romanian_ci
ucs2_slovak_ci
ucs2_slovenian_ci
ucs2_spanish2_ci
ucs2_spanish_ci
ucs2_swedish_ci
ucs2_turkish_ci
ucs2_unicode_ci
utf8
(UTF-8 Unicode) collations:
utf8_bin
utf8_czech_ci
utf8_danish_ci
utf8_esperanto_ci
utf8_estonian_ci
utf8_general_ci
(default)
utf8_hungarian_ci
utf8_icelandic_ci
utf8_latvian_ci
utf8_lithuanian_ci
utf8_persian_ci
utf8_polish_ci
utf8_roman_ci
utf8_romanian_ci
utf8_slovak_ci
utf8_slovenian_ci
utf8_spanish2_ci
utf8_spanish_ci
utf8_swedish_ci
utf8_turkish_ci
utf8_unicode_ci
Note that in the ucs2_roman_ci
and
utf8_roman_ci
collations,
I
and J
compare as equals,
and U
and V
compare as
equals.
The ucs2_esperanto_ci
and
utf8_esperanto_ci
collations were added in
MySQL 5.0.13. The ucs2_hungarian_ci
and
utf8_hungarian_ci
collations were added in
MySQL 5.0.19.
MySQL implements the utf8_unicode_ci
collation according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The following discussion uses
utf8_unicode_ci
, but it is also true for
ucs2_unicode_ci
.
Currently, the utf8_unicode_ci
collation has
only partial support for the Unicode Collation Algorithm. Some
characters are not supported yet. Also, combining marks are not
fully supported. This affects primarily Vietnamese and some
minority languages in Russia such as Udmurt, Tatar, Bashkir, and
Mari.
The most significant feature in
utf8_unicode_ci
is that it supports
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ‘ß
’ is
equal to ‘ss
’.
utf8_general_ci
is a legacy collation that
does not support expansions. It can make only one-to-one
comparisons between characters. This means that comparisons for
the utf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
.
For example, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
:
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
:
ß = ss
MySQL implements language-specific collations for the
utf8
character set only if the ordering with
utf8_unicode_ci
does not work well for a
language. For example, utf8_unicode_ci
works
fine for German and French, so there is no need to create
special utf8
collations for these two
languages.
utf8_general_ci
also is satisfactory for both
German and French, except that
‘ß
’ is equal to
‘s
’, and not to
‘ss
’. If this is acceptable for
your application, then you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it is
more accurate.
utf8_swedish_ci
, like other
utf8
language-specific collations, is derived
from utf8_unicode_ci
with additional language
rules. For example, in Swedish, the following relationship
holds, which is not something expected by a German or French
speaker:
Ü = Y < Ö
The utf8_spanish_ci
and
utf8_spanish2_ci
collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘ñ
’ (n-tilde) is a
separate letter between ‘n
’ and
‘o
’. In addition, for traditional
Spanish, ‘ch
’ is a separate
letter between ‘c
’ and
‘d
’, and
‘ll
’ is a separate letter between
‘l
’ and
‘m
’
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
ascii
(US ASCII) collations:
ascii_bin
ascii_general_ci
(default)
cp850
(DOS West European) collations:
cp850_bin
cp850_general_ci
(default)
dec8
(DEC Western European) collations:
dec8_bin
dec8_swedish_ci
(default)
hp8
(HP Western European) collations:
hp8_bin
hp8_english_ci
(default)
latin1
(cp1252 West European) collations:
latin1_bin
latin1_danish_ci
latin1_general_ci
latin1_general_cs
latin1_german1_ci
latin1_german2_ci
latin1_spanish_ci
latin1_swedish_ci
(default)
latin1
is the default character set.
MySQL's latin1
is the same as the Windows
cp1252
character set. This means it is
the same as the official ISO 8859-1
or
IANA (Internet Assigned Numbers Authority)
latin1
, but IANA
latin1
treats the code points between
0x80
and 0x9f
as
“undefined,” whereas cp1252
,
and therefore MySQL's latin1
, assign
characters for those positions. For example,
0x80
is the Euro sign. For the
“undefined” entries in
cp1252
, MySQL translates
0x81
to Unicode
0x0081
, 0x8d
to
0x008d
, 0x8f
to
0x008f
, 0x90
to
0x0090
, and 0x9d
to
0x009d
.
The latin1_swedish_ci
collation is the
default that probably is used by the majority of MySQL
customers. Although it is frequently said that it is based
on the Swedish/Finnish collation rules, there are Swedes and
Finns who disagree with this statement.
The latin1_german1_ci
and
latin1_german2_ci
collations are based on
the DIN-1 and DIN-2 standards, where DIN stands for
Deutsches Institut für
Normung (the German equivalent of ANSI).
DIN-1 is called the “dictionary collation” and
DIN-2 is called the “phone book collation.”
latin1_german1_ci
(dictionary) rules:
Ä = A Ö = O Ü = U ß = s
latin1_german2_ci
(phone-book) rules:
Ä = AE Ö = OE Ü = UE ß = ss
In the latin1_spanish_ci
collation,
‘ñ
’ (n-tilde) is a separate
letter between ‘n
’ and
‘o
’.
macroman
(Mac West European) collations:
macroman_bin
macroman_general_ci
(default)
swe7
(7bit Swedish) collations:
swe7_bin
swe7_swedish_ci
(default)
MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.
cp1250
(Windows Central European)
collations:
cp1250_bin
cp1250_croatian_ci
cp1250_czech_cs
cp1250_general_ci
(default)
cp852
(DOS Central European) collations:
cp852_bin
cp852_general_ci
(default)
keybcs2
(DOS Kamenicky Czech-Slovak)
collations:
keybcs2_bin
keybcs2_general_ci
(default)
latin2
(ISO 8859-2 Central European)
collations:
latin2_bin
latin2_croatian_ci
latin2_czech_cs
latin2_general_ci
(default)
latin2_hungarian_ci
macce
(Mac Central European) collations:
macce_bin
macce_general_ci
(default)
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.
armscii8
(ARMSCII-8 Armenian) collations:
armscii8_bin
armscii8_general_ci
(default)
cp1256
(Windows Arabic) collations:
cp1256_bin
cp1256_general_ci
(default)
geostd8
(GEOSTD8 Georgian) collations:
geostd8_bin
geostd8_general_ci
(default)
greek
(ISO 8859-7 Greek) collations:
greek_bin
greek_general_ci
(default)
hebrew
(ISO 8859-8 Hebrew) collations:
hebrew_bin
hebrew_general_ci
(default)
latin5
(ISO 8859-9 Turkish) collations:
latin5_bin
latin5_turkish_ci
(default)
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.
cp1257
(Windows Baltic) collations:
cp1257_bin
cp1257_general_ci
(default)
cp1257_lithuanian_ci
latin7
(ISO 8859-13 Baltic) collations:
latin7_bin
latin7_estonian_cs
latin7_general_ci
(default)
latin7_general_cs
The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.
cp1251
(Windows Cyrillic) collations:
cp1251_bin
cp1251_bulgarian_ci
cp1251_general_ci
(default)
cp1251_general_cs
cp1251_ukrainian_ci
cp866
(DOS Russian) collations:
cp866_bin
cp866_general_ci
(default)
koi8r
(KOI8-R Relcom Russian) collations:
koi8r_bin
koi8r_general_ci
(default)
koi8u
(KOI8-U Ukrainian) collations:
koi8u_bin
koi8u_general_ci
(default)
The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of different
characters. See Section 10.9.7.1, “The cp932
Character Set”, for additional
information about the cp932
and
sjis
character sets.
big5
(Big5 Traditional Chinese)
collations:
big5_bin
big5_chinese_ci
(default)
cp932
(SJIS for Windows Japanese)
collations:
cp932_bin
cp932_japanese_ci
(default)
eucjpms
(UJIS for Windows Japanese)
collations:
eucjpms_bin
eucjpms_japanese_ci
(default)
euckr
(EUC-KR Korean) collations:
euckr_bin
euckr_korean_ci
(default)
gb2312
(GB2312 Simplified Chinese)
collations:
gb2312_bin
gb2312_chinese_ci
(default)
gbk
(GBK Simplified Chinese) collations:
gbk_bin
gbk_chinese_ci
(default)
sjis
(Shift-JIS Japanese) collations:
sjis_bin
sjis_japanese_ci
(default)
tis620
(TIS620 Thai) collations:
tis620_bin
tis620_thai_ci
(default)
ujis
(EUC-JP Japanese) collations:
ujis_bin
ujis_japanese_ci
(default)
Why is cp932
needed?
In MySQL, the sjis
character set
corresponds to the Shift_JIS
character set
defined by IANA, which supports JIS X0201 and JIS X0208
characters. (See
http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to Shift_JIS
that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese Windows
environments is a Microsoft extension of
Shift_JIS
and its exact name is
Microsoft Windows Codepage : 932
or
cp932
. In addition to the characters
supported by Shift_JIS
,
cp932
supports extension characters such as
NEC special characters, NEC selected — IBM extended
characters, and IBM extended characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted via Unicode
(ucs2
).
The sjis
character set does not support
the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL cp932
character set is designed
to solve these problems. It is available as of MySQL 5.0.3.
Because MySQL supports character set conversion, it is
important to separate IANA Shift_JIS
and
cp932
into two different character sets
because they provide different conversion rules.
How does cp932
differ
from sjis
?
The cp932
character set differs from
sjis
in the following ways:
cp932
supports NEC special characters,
NEC selected — IBM extended characters, and IBM
selected characters.
Some cp932
characters have two
different code points, both of which convert to the same
Unicode code point. When converting from Unicode back to
cp932
, one of the code points must be
selected. For this “round trip conversion,”
the rule recommended by Microsoft is used. (See
http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.
The table shown at
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
provides information about the Unicode values of
cp932
characters. For
cp932
table entries with characters
under which a four-digit number appears, the number
represents the corresponding Unicode
(ucs2
) encoding. For table entries with
an underlined two-digit value appears, there is a range of
cp932
character values that begin with
those two digits. Clicking such a table entry takes you to
a page that displays the Unicode value for each of the
cp932
characters that begin with those
digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
NEC selected — IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
Starting from version 5.0.3, cp932
supports conversion of user-defined characters in
combination with eucjpms
, and solves
the problems with
sjis
/ujis
conversion. For details, please refer to
http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from
ucs2
is different for
sjis
and cp932
. The
following tables illustrate these differences.
Conversion to ucs2
:
sjis /cp932
Value | sjis ->
ucs2 Conversion | cp932 ->
ucs2 Conversion |
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00A3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion from ucs2
:
ucs2 value | ucs2 ->
sjis Conversion | ucs2 ->
cp932 Conversion |
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00A3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |
Users of any Japanese character sets should be aware that
using --character-set-client-handshake
(or
--skip-character-set-client-handshake
) has an
important effect. See Section 5.2.1, “mysqld Command Options”.
This Frequently-Asked-Questions section comes from the experiences of MySQL's Support and Development groups, after handling many enquiries about CJK (Chinese Japanese Korean) issues.
You inserted CJK characters with INSERT
, but
when you do a SELECT
, they all look like
“?”. It usually is a setting in MySQL that doesn't
match the settings for the application program or the operating
system. These are common troubleshooting steps:
Find out: what version do you have? The statement
SELECT VERSION();
will tell you. This FAQ
is for MySQL version 5, so some of the answers here will not
apply to you if you have version 4.0 or 4.1.
Find out: what character set is the database column really
in? Too frequently, people think that the character set will
be the same as the server's set (false), or the set used for
display purposes (false). Make sure, by saying SHOW
CREATE TABLE tablename
, or better yet by saying
this:
SELECT character_set_name, collation_name FROM information_schema.columns WHERE table_schema = your_database_name AND table_name = your_table_name AND column_name = your_column_name;
Find out: what is the hexadecimal value?
SELECT HEX(your_column_name) FROM your_table_name;
If you see 3F
, then that really is the
encoding for ?
, so no wonder you see
“?”. Probably this happened because of a
problem converting a particular character from your client
character set to the target character set.
Find out: is a literal round trip possible, that is, if you
select “literal” (or “_introducer
hexadecimal-value”) do you get “literal”
as a result? For example, with the Japanese Katakana Letter
Pe, which looks like ペ'
, and which
exists in all CJK character sets, and which has the code
point value (hexadecimal coding) 0x30da
,
enter:
SELECT 'ペ' AS `ペ`; /* or SELECT _ucs2 0x30da; */
If the result doesn't look like ペ
, a
round trip failed. For bug reports, we might ask people to
follow up with SELECT hex('ペ');
. Then
we can see whether the client encoding is right.
Find out: is it the browser or application? Just use mysql (the MySQL client program, which on Windows will be mysql.exe). If mysql displays correctly but your application doesn't, then your problem is probably “Settings”, but consult also the question about “Troubles with Access (or Perl) (or PHP) (etc.)” much later in this FAQ.
To find your settings, the statement you need here is
SHOW VARIABLES
. For example:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.03 sec)
The above are typical character-set settings for an
international-oriented client (notice the use of
utf8
Unicode) connected to a server in
the West (latin1
is a West Europe
character set and a default for MySQL).
Although Unicode (usually the utf8
variant on Unix, usually the ucs2
variant
on Windows) is better than “latin”, it's often
not what your operating system utilities support best. Many
Windows users find that a Microsoft character set, such as
cp932
for Japanese Windows, is what's
suitable.
If you can't control the server settings, and you have no
idea what your underlying computer is about, then try
changing to a common character set for the country that
you're in (euckr
= Korea,
gb2312
or gbk
=
People's Republic of China, big5
= other
China, sjis
or ujis
or
cp932
or eucjpms
=
Japan, ucs2
or utf8
=
anywhere). Usually it is only necessary to change the client
and connection and results settings, and there is a simple
statement which changes all three at once, namely
SET NAMES
. For example:
SET NAMES 'big5';
Once you get the correct setting, you can make it permanent
by editing my.cnf
or
my.ini
. For example you might add lines
looking like this:
[mysqld] character-set-server=big5 [client] default-character-set=big5
MySQL supports the two common variants of the GB (“Guojia
Biaozhun” or “National Standard”) character
sets which are official in the People's Republic of China:
gb2312
and gbk
. Sometimes
people try to insert gbk
characters into
gb2312
, and it works most of the time because
gbk
is a superset of gb2312
.
But eventually they try to insert a rarer Chinese character and it
doesn't work. (Example: Bug#16072 in our bugs database,
http://bugs.mysql.com/bug.php?id=16072). So we'll
try to clarify here exactly what characters are legitimate in
gb2312
or gbk
, with
reference to the official documents. Please check these references
before reporting gb2312
or
gbk
bugs. We now have a graphic listing of the
gbk
characters, currently on the site of Mr
Alexander Barkov (MySQL's principal programmer for character set
issues). The chart is in order according to the
gb2312_chinese_ci
collation:
http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html.
MySQL's gbk
is in reality “Microsoft code
page 936”. This differs from the official
gbk
for characters A1A4
(middle dot), A1AA
(em dash),
A6E0-A6F5
, and A8BB-A8C0
.
For a listing of the differences, see
http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h.
For a listing of gbk/Unicode mappings, see
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.
For MySQL's listing of gbk characters, see
http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html.
MySQL supports the Big5 character set which is common in Hong Kong
and the Republic of China (Taiwan). MySQL's
big5
is in reality “Microsoft code page
950”, which is very similar to the original
big5
character set. This is a recent change,
starting with MySQL version 4.1.16 / 5.0.16. We made the change as
a result of a bug report, Bug#12476 in our bugs database,
http://bugs.mysql.com/bug.php?id=12476 (title:
“Some big5 codes are still missing ...”). For
example, the following statements work in the current version of
MySQL, but not in old versions:
mysql>create table big5 (big5 char(1) character set big5);
Query OK, 0 rows affected (0.13 sec) mysql>insert into big5 values (0xf9dc);
Query OK, 1 row affected (0.00 sec) mysql>select * from big5;
+------+ | big5 | +------+ | 嫺 | +------+ 1 row in set (0.02 sec)
There is a feature request for adding HKSCS extensions (Bug#13577 in our bugs database, http://bugs.mysql.com/bug.php?id=13577). People who need the extension may find the suggested patch for Bug#13577 is of interest.
MySQL supports the sjis
,
ujis
, cp932
, and eucjpms
character sets, as well as Unicode. A common need is to convert
between character sets. For example, there might be a Unix server
(typically with sjis
or
ujis
) and a Windows client (typically with
cp932
). But conversions can seem to fail.
Here's why. In this conversion table, the ucs2
column is the source, and the
sjis
/cp932
/ujis
/eucjpms
columns are the destination, that is, what the hexadecimal result
would be if we used CONVERT(ucs2)
or if we
assigned a ucs2
column containing the value to
an
sjis
/cp932
/ujis
/eucjpms
column.
character name ucs2 sjis cp932 ujis eucjpms -------------- ---- ---- ---- ---- ------- BROKEN BAR 00A6 3F 3F 8FA2C3 3F FULLWIDTH BROKEN BAR FFE4 3F FA55 3F 8FA2 YEN SIGN 00A5 3F 3F 20 3F FULLWIDTH YEN SIGN FFE5 818F 818F A1EF 3F TILDE 007E 7E 7E 7E 7E OVERLINE 203E 3F 3F 20 3F HORIZONTAL BAR 2015 815C 815C A1BD A1BD EM DASH 2014 3F 3F 3F 3F REVERSE SOLIDUS 005C 815F 5C 5C 5C FULLWIDTH "" FF3C 3F 815F 3F A1C0 WAVE DASH 301C 8160 3F A1C1 3F FULLWIDTH TILDE FF5E 3F 8160 3F A1C1 DOUBLE VERTICAL LINE 2016 8161 3F A1C2 3F PARALLEL TO 2225 3F 8161 3F A1C2 MINUS SIGN 2212 817C 3F A1DD 3F FULLWIDTH HYPHEN-MINUS FF0D 3F 817C 3F A1DD CENT SIGN 00A2 8191 3F A1F1 3F FULLWIDTH CENT SIGN FFE0 3F 8191 3F A1F1 POUND SIGN 00A3 8192 3F A1F2 3F FULLWIDTH POUND SIGN FFE1 3F 8192 3F A1F2 NOT SIGN 00AC 81CA 3F A2CC 3F FULLWIDTH NOT SIGN FFE2 3F 81CA 3F A2CC
For example, consider this extract from the table:
ucs2 sjis cp932 ---- ---- ----- NOT SIGN 00AC 81CA 3F FULLWIDTH NOT SIGN FFE2 3F 81CA
It means “for NOT SIGN which is Unicode U+00AC, MySQL
converts to sjis code point 0x81CA and to cp932 code point
3F”. (3F
is question mark
(“?”) and is what we always use when we can't
convert.) Now, what should we do if we want to convert
sjis 81CA
to cp932
? Our
answer is: “?”. There are serious complaints about
this, many people would prefer a “loose” conversion,
so that 81CA (NOT SIGN)
in
sjis
becomes 81CA (FULLWIDTH NOT
SIGN)
in cp932
. We are considering
changing.
In SJIS the code for Yen Sign (¥
) is
5C
. In SJIS the code for Reverse Solidus
(\
) is 5C
. Since the above
statements are contradictory, confusion often results. Well, to
put it more seriously, some versions of Japanese character sets
(both sjis
and euc
) have
treated 5C
as a reverse solidus, also known as
a backslash, and others have treated it as a yen sign. There's
nothing we can do, except take sides: MySQL follows only one
version of the JIS (Japanese Industrial Standards) standard
description, and 5C is Reverse Solidus,
always. Should we make a separate character set where
5C
is Yen Sign, as another DBMS (Oracle) does?
We haven't decided. Certainly not in version 5.1 or 5.2. But if
people keep complaining about The Great Yen Sign Problem, that's
one possible solution.
MySQL supports the euckr
(Extended Unix Code
Korea) character set which is common in South Korea. In theory,
problems could arise because there have been several versions of
this character set. So far, only one problem has been noted, for
Korea's currency symbol. We use the “ASCII” variant
of EUC-KR, in which the code point 0x5c
is
REVERSE SOLIDUS, that is \
, instead of the
“KS-Roman” variant of EUC-KR, in which the code point
0x5c
is WON SIGN, that is “₩”.
You can't convert Unicode U+20A9
WON SIGN to
euckr
:
mysql>SELECT CONVERT('₩' USING euckr) AS euckr,
->HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
+-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec)
MySQL's graphic Korean chart is here: http://d.udm.net/bar/~bar/charts/euckr_korean_ci.html.
For illustration, we'll make a table with one Unicode
(ucs2
) column and one Chinese
(gb2312
) column.
mysql>CREATE TABLE ch
->(ucs2 CHAR(3) CHARACTER SET ucs2,
->gb2312 CHAR(3) CHARACTER SET gb2312);
Query OK, 0 rows affected (0.05 sec)
We'll try to place the rare character 汌
in
both columns.
mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec)
Ah, there's a warning. Let's see what it is.
mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
+---------+------+---------------------------------------------+
1 row in set (0.00 sec)
So it's a warning about the gb2312 column only.
mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch; +-------+--------------+--------+-------------+ | ucs2 | HEX(ucs2) | gb2312 | HEX(gb2312) | +-------+--------------+--------+-------------+ | A汌B | 00416C4C0042 | A?B | 413F42 | +-------+--------------+--------+-------------+ 1 row in set (0.00 sec)
There are several things that need explanation here.
The fact that it's a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.
The 汌
character isn't in the
gb2312
character set. We described that
problem earlier.
Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (Bug#9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.
With SQL_MODE=TRADITIONAL
, there would be
an error message, but instead of error 2165 you would see:
ERROR 1406 (22001): Data too long for column
'gb2312' at row 1
.
You can't get things to look right with your special program for a
GUI front end or browser? Get a direct connection to the server
(with mysql on Unix or with
mysql.exe on Windows) and try the same query
there. If mysql is okay, then the trouble is probably that your
application interface needs some initializing. Use
mysql to tell you what character set(s) it
uses, by saying SHOW VARIABLES LIKE 'char%';
.
If it's Access, you're probably connecting with MyODBC. So you'll
want to check out the Reference Manual page for configuring an
ODBC DSN, and pay attention particularly to the illustrations for
“SQL command on connect”. You should enter
SET NAMES 'big5'
(supposing that you use
big5
) (you don't need a ;
here). If it's ASP, you might need to add SET
NAMES
in the code. Here is an example that has worked in
the past:
<%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=yourserver;uid=yourusername;" \
& "pwd=yourpassword;database=yourdatabase;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject(“ADODB.Connection”)
Conn.Open strConnection
%>
If it's PHP, here's a slightly different user suggestion:
<?php $link = mysql_connect($host,$usr,$pwd); mysql_select_db($db); if (mysql_error()) { print "Database ERROR: " . mysql_error(); } mysql_query("SET CHARACTER SET utf8", $link); mysql_query("SET NAMES 'utf8'", $link); ?>
In this case, the tipper used SET CHARACTER SET
statement to change character_set_client
and
character_set_result
, and used SET
NAMES
to change character_set_client
and character_set_connection
and
character_set_results
. So actually the
SET CHARACTER SET
statement is redundant.
(Incidentally, MySQL people encourage the use of the
mysqli
extension, rather than the
mysql
example that this example uses.) Another
thing to check with PHP is the browser assumptions. Sometimes a
meta tag change in the heading area suffices, for example:
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
For Connector/J tips, see the manual section in the Connectors chapter titled “Using Character Sets and Unicode”.
In the old days, with MySQL Version 4.0, there was a single “global” character set for both server and client sides, and the decision was made by the server administrator. We changed that starting with MySQL Version 4.1. What happens now is a “handshake”. The MySQL Reference Manual describes it thus:
When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the
character_set_client
,character_set_results
, andcharacter_set_connection
system variables. In effect, the server performs aSET NAMES
operation using the character set name.
The effect of this is: you can't control the client character set
by saying mysqld --character-set-server=utf8
.
But some Asian customers said that they don't like that, they want
the MySQL 4.0 behaviour. So we added a mysqld
switch, --character-set-client-handshake
, which
(and this is the interesting part) can be turned off with
--skip-character-set-client-handshake
. If you
start mysqld with
--skip-character-set-client-handshake
, then the
behaviour is like this: When a client connects, it sends to the
server the name of the character set that it wants to use. The
server ignores it! Here is an illustration with the handshake
switch on or off. Pretend that your favourite server character set
is latin1
(of course that's unlikely in a CJK
area but it's MySQL's default if there's no
my.ini
or my.cnf
file).
Pretend that the client operates with utf8
because that's what the client's operating system supports. Start
the server with a default character set,
latin1
:
mysqld --character-set-server=latin1
Start the client with a default character set,
utf8
:
mysql --default-character-set=utf8
Show what the current settings are:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
Stop the client. Stop the server with mysqladmin. Start the server again but this time say “skip the handshake”:
mysqld --character-set-server=utf8 --skip-character-set-client-handshake
Start the client with a default character set,
utf8
, again. Show what the current settings
are, again:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
As you can see by comparing the SHOW VARIABLES
results, the server ignores the client's initial settings if the
--skip-character-set-client-handshake
is used.
There is a simple problem with LIKE
searches on
BINARY
and BLOB
columns: we
need to know the end of a character. With multi-byte character
sets, different characters might have different octet lengths. For
example, in utf8
, A
requires
one byte but ペ
requires three bytes.
Illustration:
+-------------------------+---------------------------+ | octet_length(_utf8 'A') | octet_length(_utf8 'ペ') | +-------------------------+---------------------------+ | 1 | 3 | +-------------------------+---------------------------+ 1 row in set (0.00 sec)
If we don't know where the first character ends, then we don't
know where the second character begins, and even simple-looking
searches like LIKE '_A%'
will fail. The
solution is to use a regular CJK character set in the first place,
or convert to a CJK character character set before comparing.
Incidentally, this is one reason why MySQL cannot allow encodings
of nonexistent characters: It must be strict about rejecting bad
input, or it won't know where characters end. There is a simple
problem with FULLTEXT
: we need to know the end
of a word. With Western writing this is rarely a problem because
there are spaces between words. With Asian writing this is not the
case. We could use half-good solutions, like saying that all Han
characters represent words, or depending on (Japanese) changes
from Katakana to Hiragana which are due to grammatical endings.
But the only good solution requires a dictionary, and we haven't
found a good open-source dictionary.
The list of CJK character sets may vary depending on version. For
example, the eucjpms
character set is a recent
addition. But the language name appears in the
DESCRIPTION
column for every entry in
information_schema.character_sets
. Therefore,
to get a current list of all the non-Unicode CJK character sets,
say:
mysql>SELECT character_set_name, description
->FROM information_schema.character_sets
->WHERE description LIKE '%Chinese%'
->OR description LIKE '%Japanese%'
->OR description LIKE '%Korean%'
->ORDER BY character_set_name;
+--------------------+---------------------------+ | character_set_name | description | +--------------------+---------------------------+ | big5 | Big5 Traditional Chinese | | cp932 | SJIS for Windows Japanese | | eucjpms | UJIS for Windows Japanese | | euckr | EUC-KR Korean | | gb2312 | GB2312 Simplified Chinese | | gbk | GBK Simplified Chinese | | sjis | Shift-JIS Japanese | | ujis | EUC-JP Japanese | +--------------------+---------------------------+ 8 rows in set (0.01 sec)
The majority of everyday-use Chinese/Japanese characters (simplified Chinese and basic non-halfwidth Kana Japanese) appear in all CJK character sets. Here is a stored procedure which accepts a UCS-2 Unicode character, converts it to all other character sets, and displays the results in hexadecimal.
DELIMITER // CREATE PROCEDURE p_convert (ucs2_char CHAR(1) CHARACTER SET ucs2) BEGIN CREATE TABLE tj (ucs2 CHAR(1) character set ucs2, utf8 CHAR(1) character set utf8, big5 CHAR(1) character set big5, cp932 CHAR(1) character set cp932, eucjpms CHAR(1) character set eucjpms, euckr CHAR(1) character set euckr, gb2312 CHAR(1) character set gb2312, gbk CHAR(1) character set gbk, sjis CHAR(1) character set sjis, ujis CHAR(1) character set ujis); INSERT INTO tj (ucs2) VALUES (ucs2_char); UPDATE tj SET utf8=ucs2, big5=ucs2, cp932=ucs2, eucjpms=ucs2, euckr=ucs2, gb2312=ucs2, gbk=ucs2, sjis=ucs2, ujis=ucs2; /* If there's a conversion problem, UPDATE will produce a warning. */ SELECT hex(ucs2) AS ucs2, hex(utf8) AS utf8, hex(big5) AS big5, hex(cp932) AS cp932, hex(eucjpms) AS eucjpms, hex(euckr) AS euckr, hex(gb2312) AS gb2312, hex(gbk) AS gbk, hex(sjis) AS sjis, hex(ujis) AS ujis FROM tj; DROP TABLE tj; END//
The input can be any single ucs2
character, or
it can be the code point value (hexadecimal representation) of
that character. Here's an example of what
P_CONVERT()
can do. An earlier answer said
that the character “Katakana Letter Pe” appears in
all CJK character sets. We know that the code point value of
Katakana Letter Pe is 0x30da
. (By the way, we
got the name from Unicode's list of ucs2 encodings and names:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.)
So we'll say:
mysql> CALL P_CONVERT(0x30da)//
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8 | big5 | cp932 | eucjpms | euckr | gb2312 | gbk | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379 | A5DA | ABDA | A5DA | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
1 row in set (0.04 sec)
Since none of the column values is 3F
, we know
that every conversion worked.
Sometimes people observe that the result of a
utf8_unicode_ci
or
ucs2_unicode_ci
search or ORDER
BY
sort is not what they think a native would expect.
Although we never rule out the chance that there is a bug, we have
found in the past that people are not correctly reading the
standard table of weights for the Unicode Collation Algorithm. So,
here's how to check whether we're using the right collation. The
correct table for MySQL is this one:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
This is different from the first table you will find by navigating
from the unicode.org
home page. MySQL
deliberately uses the older 4.0.0 “allkeys” table,
instead of the current 4.1.0 table. We are very wary about
changing ordering which affects indexes. Here is an example of a
problem that we handled recently, for a complaint in our bugs
database, http://bugs.mysql.com/bug.php?id=16526:
mysql>CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
Query OK, 0 rows affected (0.05 sec) mysql>INSERT INTO tj VALUES ('が'),('か');
Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql>SELECT * FROM tj WHERE s1 = 'か';
+------+ | s1 | +------+ | が | | か | +------+ 2 rows in set (0.00 sec)
If your eyes are sharp, you'll see that the character in the first
result row isn't the one that we searched for. Why did MySQL
retrieve it? First we look for the Unicode code point value, which
is possible by reading the hexadecimal number for the
ucs2
version of the characters:
mysql> SELECT s1,HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1 | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が | 304C |
| か | 304B |
+------+-----------------------------+
2 rows in set (0.03 sec)
Now let's search for 304B
and
304C
in the 4.0.0 allkeys table. We'll find
these lines:
304B ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA 304C ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
The official Unicode names (following the “#” mark)
are informative; they tell us the Japanese syllabary (Hiragana),
the informal classification (letter instead of digit or
punctuation), and the Western identifier (KA
or
GA
, which happen to be voiced/unvoiced
components of the same letter pair). More importantly, the Primary
Weight (the first hexadecimal number inside the square brackets)
is 1E57
on both lines. For comparisons in both
searching and sorting, MySQL pays attention only to the Primary
Weight, it ignores all the other numbers. So now we know that
we're sorting が
and か
correctly according to the Unicode specification. If we wanted to
distinguish them, we'd have to use a
non-Unicode-Collation-Algorithm collation
(utf8_unicode_bin
or
utf8_general_ci
), or compare the
HEX()
values, or say ORDER BY
CONVERT(s1 USING sjis)
. Being correct “according
to Unicode” isn't enough, of course: the person who
submitted the bug was equally correct. We plan to add another
collation for Japanese according to the JIS X 4061 standard, where
voiced/unvoiced letters like KA/GA are distinguishable for
ordering purposes.
You're using Unicode (ucs2
or
utf8
), and you know what the Unicode sort order
is (see the previous question and answer), but MySQL still seems
to sort your table wrong? This might be easy.
mysql> SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Hmm, the character set looks okay. Let's look at the
information_schema
for this column.
mysql>SELECT column_name, character_set_name, collation_name
->FROM information_schema.columns
->WHERE column_name = 's1'
->AND table_name = 't';
+-------------+--------------------+-----------------+ | column_name | character_set_name | collation_name | +-------------+--------------------+-----------------+ | s1 | ucs2 | ucs2_general_ci | +-------------+--------------------+-----------------+ 1 row in set (0.01 sec)
Oops, the collation is ucs2_general_ci
instead
of ucs2_unicode_ci
! Here's why:
mysql> SHOW CHARSET LIKE 'ucs2%';
+---------+---------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)
For ucs2
and utf8
, the
“general” collation is the default. To specify that
you wanted a “unicode” collation, you should have
specified COLLATE ucs2_unicode_ci
.
Right. MySQL doesn't support supplementary characters (characters
which need more than 3 bytes with UTF-8). We support only what
Unicode calls the Basic Multilingual Plane / Plane
0. Only a few very rare Han characters are
supplementary; support for them is uncommon. This has led to Bug#12600 (http://bugs.mysql.com/bug.php?id=12600)
which we rejected as “not a bug”. With
utf8
, we must truncate an input string when we
encounter bytes that we don't understand. Otherwise, we wouldn't
know how long the bad multi-byte character is. A workaround is: if
you use ucs2
instead of
utf8
, then the bad characters will change to
question marks, but there will be no truncation. Or change the
data type to BLOB
or BINARY
,
which have no validity checking. In our bugs database, Bug#14052
(http://bugs.mysql.com/bug.php?id=14052) is a
feature request for Wikipedia, asking us to support supplementary
characters extending ucs2
as well as
utf8
.
No. The term CJKV (Chinese Japanese Korean Vietnamese) refers to character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters. Another question that has come up (once) is a request for specialized Vietnamese collation, see http://bugs.mysql.com/bug.php?id=4745. We might do something about it someday, if many more requests arise.
Yes. We're changing the names of files and directories. Here's an
example, using mysql as root
under Linux:
Create a table with a name containing a Han character:
mysql> CREATE TABLE tab_楮 (s1 INT);
Query OK, 0 rows affected (0.07 sec)
Find out where MySQL stores database files:
mysql> SHOW VARIABLES LIKE 'datadir';
+---------------+-----------------------+
| Variable_name | Value |
+---------------+-----------------------+
| datadir | /usr/local/mysql/var/ |
+---------------+-----------------------+
1 row in set (0.00 sec)
Look at the directory to see the MyISAM table files:
# cd /usr/local/mysql/var/dba # dir tab_* -rw-rw---- 1 root root 0 2006-05-16 10:22 tab_@696e.MYD -rw-rw---- 1 root root 1024 2006-05-16 10:22 tab_@696e.MYI -rw-rw---- 1 root root 8556 2006-05-16 10:22 tab_@696e.frm
Notice that MySQL has converted the Han character to
@
+ (Unicode value of Han character), that is,
to a purely ASCII representation. This solves an old problem, that
database files weren't portable, because some computers wouldn't
allow 楮
in a file name. Conversion to the new
file names will be automatic when you upgrade to version 5.1. This
should take care of Bug#6313 in our bugs database,
http://bugs.mysql.com/bug.php?id=6313.
A Beijing-based group has produced a Simplified Chinese version for us under contract. It's complete and can be found on http://dev.mysql.com/doc/#chinese-5.1. It's up to date as of version 5.1.2. The Japanese manual can be downloaded from http://dev.mysql.com/doc/#japanese-4.1. It is still for version 4.1.
Check http://dev.mysql.com/user-groups/ to see if there is a MySQL user group near you. If there isn't: why not start one yourself? To contact a sales engineer in MySQL KK's Japan office:
Tel: +81(0)3-5326-3133 Fax: +81(0)3-5326-3001 Email: dsaito@mysql.com
To see feature requests about language issues:
Go to http://bugs.mysql.com.
Click
.
In the Severity dropdown box, click
S4 (Feature Request)
.
In the list box beside Category, click
Character Sets
.
Click the
button.You can post CJK questions, or see previous answers, on MySQL's “Character Sets, Collation, Unicode” forum: http://forums.mysql.com/list.php?103. MySQL plans to add native-language forums on http://forums.mysql.com/ very soon.