Case sensitivity in MySQL and MariaDB queries

A "woman" android searching a text (probably... to be fair, the image is a bit abstract!)
Searching a text is not a simple task!

Maybe you’re wondering why in MySQL/MariaDB 'string' seems to be the same as 'STRING'. Or maybe that’s not the case for you, but you would like to make a case insensitive search. This article explains how to write a case sensitive or a case insensitive query.

The wrong way

I’ve seen terrible queries doing something like these:

SELECT * FROM book WHERE LOWER(title) = 'solaris';
SELECT * FROM book WHERE UPPER(title) = 'SOLARIS';

Why are they terrible? Because if we pass a column to a function, MySQL (or any other DBMS) cannot use an index to speed up the query. It will have to run a full table scan and apply the LOWER() function to all values it finds.

I believe that these queries are often produced by ORMs.

How it really works

In MySQL string comparison can be case sensitive or case insensitive. This normally depends on the collation used.

Collations

A collation is a set of rules associated to a character set (utf8. latin1…) to determine the order of characters, and if two characters are equal. Equal characters could be for example ‘a’ and ‘├í’, or ‘a’ and ‘A’ – in the latter care, a collation is case insensitive.

mysql> SHOW COLLATION;
+----------------------------+----------+-----+---------+----------+---------+---------------+
| Collation                  | Charset  | Id  | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+----------+-----+---------+----------+---------+---------------+
| armscii8_bin               | armscii8 |  64 |         | Yes      |       1 | PAD SPACE     |
| armscii8_general_ci        | armscii8 |  32 | Yes     | Yes      |       1 | PAD SPACE     |
| ascii_bin                  | ascii    |  65 |         | Yes      |       1 | PAD SPACE     |
| ascii_general_ci           | ascii    |  11 | Yes     | Yes      |       1 | PAD SPACE     |
| big5_bin                   | big5     |  84 |         | Yes      |       1 | PAD SPACE     |
| big5_chinese_ci            | big5     |   1 | Yes     | Yes      |       1 | PAD SPACE     |
| binary                     | binary   |  63 | Yes     | Yes      |       1 | NO PAD        |
...
+----------------------------+----------+-----+---------+----------+---------+---------------+
272 rows in set (0.01 sec)

As you can see:

  • Each collation is associated to a specific character set.
  • Case-sensitive collations have the _bin or (less frequently) _cs suffix, case-insensitive collations have the _ci suffix.
    • The special binary character set and collation simply treat a string as a sequence of bytes.
  • Every character set has a default collation. All default collations are case-insensitive.
    • For the sake of completeness, I couldn’t find this information in the documentation; so in the future they could implement character sets with binary default collations.
    • In MySQL, but not in MariaDB, the default collation for utf8mb4 can be configured by the user. I discourage you from doing so, because this practice is weird (you can do it only for one character set) and error-prone.

Changing the collation

In MySQL, a collation can be set at column level. In other words, each collation may use a different collation (or even different character sets), so this is written in the column definition.

However, most of the times we use the same collation for the whole table, if not for the whole database, or even all databases. So we have the following hierarchy:

  • A column has a collation that can optionally be assigned with CREATE TABLE or ALTER TABLE. By default, it is the table collation.
  • A table has a collation that can optionally be assigned with CREATE TABLE or ALTER TABLE. By default, it is the database collation.
  • Each database has a collation that can optionally be assigned with CREATE DATABASE or ALTER DATABASE. By default, it is the collation_server server variable.

collation_server, the database collation and the table collation have the only purpose of setting a default collation for the columns, that doesn’t apply if they are created in this way:

CREATE TABLE alphabet (
    letter CHAR(1) NOT NULL
    COLLATE utf8mb4_esperanto_ci
);

So the the default collation depends on collation_server. If it wasn’t changed, its value is utf8mb4_0900_ai_ci in MySQL and latin1_swedish_ci in MariaDB.

The strings that we write in a query between quotes (literal strings) depend on the collation_connection variable. In both MySQL and MariaDB it is latin1_swedish_ci.

Comparing strings

Now, let’s try some comparisons examples.

Default comparisons

With this collation, all comparisons are case insensitive:

mysql> SELECT 'x' LIKE 'X' AS ci_like, 'x' = 'X' AS ci_eq, 'x' < 'X' ci_lt, 'x' > 'X' AS ci_gt;
+---------+-------+-------+-------+
| ci_like | ci_eq | ci_lt | ci_gt |
+---------+-------+-------+-------+
|       1 |     1 |     0 |     0 |
+---------+-------+-------+-------+
1 row in set (0.00 sec)

As expected:

  • Letters with different cases match LIKE comparisons.
  • Letters with different cases match = comparisons.
  • A lowercase letter is not less or greater than the corresponding uppercase letter.

Forcing case sensitive comparisons

It is possible to convert any string to the binary collation using the BINARY operator:

mysql> SELECT COLLATION('x') AS simple_string, COLLATION(BINARY 'a') AS binary_operator;
+-------------------+-----------------+
| simple_string     | binary_operator |
+-------------------+-----------------+
| latin1_swedish_ci | binary          |
+-------------------+-----------------+
1 row in set (0.00 sec)

Let’s try it in practice:

mysql> CREATE TABLE alphabet (
    ->     letter CHAR(1) NOT NULL
    ->         COLLATE utf8mb4_bin,
    ->     PRIMARY KEY (letter)
    -> );
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO alphabet VALUES ('a'), ('A'), ('b'), ('B');
Query OK, 4 rows affected (0.01 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM alphabet WHERE letter = 'a';
+--------+
| letter |
+--------+
| a      |
+--------+
1 row in set (0.00 sec)

Jolly good: it does what we expected!

Forcing a case-insensitive search

The opposite can also be done, but be sure to read the section Indexes, later.

mysql> CREATE TABLE alphabet (
    ->     letter CHAR(1) NOT NULL
    ->         COLLATE utf8mb4_bin,
    ->     INDEX idx_letter (letter)
    -> );
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO alphabet VALUES ('a'), ('A'), ('b'), ('B');
Query OK, 4 rows affected (0.01 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM alphabet WHERE CONVERT(letter USING utf8mb4) = 'a';
+--------+
| letter |
+--------+
| A      |
| a      |
+--------+
2 rows in set (0.00 sec)

How did we do exactly? Well, MySQL doesn’t have a function to change the collation of a string (and if you understood what collation are, you understand that the idea itself would make no sense). But we can use CONVERT() to change a string’s character set – and as we said before, all character sets default collations are case insensitive. Therefore, the query above performs a case insensitive search.

Indexes

UNIQUE indexes

UNIQUE indexes on a column with a case sensitive collation allows to insert both ‘a’ and ‘A’. If the collation is case insensitive, this is not allowed.

Query plan

For queries that don’t force a case sensitive or case insensitive collation, indexes can be used normally (be sure to understand which LIKE queries can use indexes, anyway).

For queries that force a case sensitive or case insensitive search, the rule is that an index can only be used if we don’t apply BINARY or CONVERT() to the indexed column. Therefore:

  • WHERE band = BINARY 'Jethro Tull' can use an index on column;
  • WHERE BINARY band = 'Jethro Tull' cannot user a index, so use the first form instead;
  • queries that force a case insensitive search cannot use an index.

Reference

See also

Related courses:

Conclusions

While PostgreSQL has different operators for case sensitive and insensitive searches (LIKE, ILIKE), in MySQL the case sensitivity of a query depends on the collations involved – that can be converted with BINARY or CONVERT(). This happens because MySQL allows to specify a different and Maricharacter set or collation for all individual columns or literal strings. This also affect integrity checks like UNIQUE.

Did you notice any mistake? Do you have ideas to contribute?
Please comment!

As usual, I’ll be happy to fix errors and discuss your ideas.

Toodle pip,
Federico´╗┐

Comments (2)

  1. Very useful, thank you for writing up and sharing your knowledge!

    Presumably, the default collation in MySQL was latin1_swedish_ci at some point, like it is still in MariaDB. I don’t know when this changed, though? Also, as stated in your link to the MariaDB KB, the default collation in MariaDB on Debian is utf8mb4_general_ci. Not sure if this is also the case in downstream distros?

    1. Hi Karl!
      Well, that’s a good thing to remember: defaults change over time, and every distribution can have different defaults, so it’s better to check.
      Plus, I only install MariaDB or MySQL from official repositories or Docker images.
      Anyway, the default character sets and collations changed in MySQL 8, when they made utf8mb4 faster.

Leave a Reply

Your email address will not be published. Required fields are marked *