Last Updated on
Before proceeding, I recommend reading Primary Key in InnoDB: how they work and best practices. Here we will discuss about secondary indexes common mistakes and bad practices. While the article is focused on MySQL and the InnoDB storage engine, most of this information applies to any DBMS.
What is an index?
An index is an ordered data structure.
This is true for the most common index type, at least. We will ignore everything except for B+trees – if you use something else, some of the following information does not apply to you. But it is extremely uncommon to use other index types.This is true for the most common index type, at least. We will ignore everything except for B+trees – if you use something else, some of the following information does not apply to you. But it is extremely uncommon to use other index types.
I wrote how I use the phone book analogy to explain how indexes are used, and for what purposes. The slides from my talk MySQL Query Optimisation 101 show a small example of how I use that analogy. Actually they that is just a quick introduction, but still. The idea is that, if you think to a phone book, you will instinctively find out what a DBMS can or cannot do with an index.
Common mistakes and bad practices
With this in mind, we can discuss the most common bad practices and errors I see.
I’ll explain this with the phone book example. All the people living in your city (the rows) are listed ordered by (surname, given name) – that is an index. Now, imagine you have a second index: the same people, ordered by surname, but not first name.
What purpose would it possibly serve? Absolutely none. When I say this during a training, I often hear objections. But really, think about it and you’ll find out by yourself. You already have a list of persons ordered by surname.
But what about an index on the first names? Well, that is not a duplicate. This is only apparently counter-intuitive. The phone book is not ordered by first name.
An index duplicates another if:
- The lists of their columns are identical (obviously);
- Its leftmost columns are identical to the complete list of the other index’ columns.
Identical also means that the columns order is the same. (last_name, first_name) is not the same as (first_name, last_name).
Tools like pt-duplicate-key-checker can be used periodically to find (and remove) duplicate indexes.
Indexes and the primary key
As explain in the previous article, in InnoDB the primary key is appended to all secondary indexes. In other words, if you define an index as (last_name, first_name), probably the actual index will contain (last_name, first_name, id).
As a consequence, this index:
Duplicates the following:
This is not always true. For details, see Paginating the results of an SQL query.
Mistakes to avoid
Indexes should be there to serve certain queries, not just because they could be useful. Writing to a table has a cost, and the more indexes we have, the slower the query is. Even the SQL optimiser – the component that is responsible to find an optimal query execution strategy – will be slower if we have more indexes. Because each time we run a query, it needs to consider all existing indexes to choose the best for the current query.
This is why we want to get rid of duplicate indexes.
But the worst possible situation is when we have an index on every column, because it’s even possible that none of them properly optimises a query.
Expecting the impossible
Indexes allow to:
- Find an exact value (look for Colin Baker in the phone book);
- Find a range of values (look for the persons before Colin Baken, or between Colin Baker and Charlie Brown, or even persons whose surname starts with BA);
- Order results (just read the values in order);
- Group results (count how many persons for each surname).
And of course, combinations of these operations are allowed.
Don’t try to build indexes to solve problems that can’t possibly benefit from an ordered data structure:
- Persons who are not Peter Capaldi;
- Persons whose surname contains/doesn’t contain AK;
- Persons whose surname ends/doesn’t end with R.
A less obvious example of impossible operation for an index, is the following:
WHERE LEFT(last_name, 2) = 'BA'
But we said that we can efficiently search for persons whose name starts with BA, right? Yes… but the SQL optimiser doesn’t know that we are asking that. Because it doesn’t know what
LEFT() means! Every function is just a black box. This is true for all DBMSs. The optimiser is not expected to understand functions.
The order of columns in an index is important for several reasons. We’ll start from the most most commonly understood, but be careful: it’s not the most important.
This discussion cannot cover all cases where an index can or cannot be used. But it shows the most common mistakes that I see in index design.
Can you look for all Bakers whose name is Tom? Yes, you can, it’s fast.
Now, imagine you have the imaginary phone book we already mentioned – ordered by first name, and then last name. Can you find all Tom’s whose family name is Baker?
Again: yes you can, and the order kind of helps. But it’s slower. How many Toms live in your city? They could fill many pages. Once you find Baker, you’ll still have many pages to search, to find Tom.
In more formal terms, the first part of the condition is not as selective as we want it to be. Because there are more surnames than given names – more formally, given names have a lower cardinality.
Ideally, your indexes should contain the highest cardinality columns first.
But before doing that, check the following paragraphs.
In MySQL, the cardinality of an index can be seen with
The leftmost rule
You obviously can use the phone book to find Tom Baker or all Bakers, but not all Toms.
This is the leftmost rule: A query can use a whole index, or its leftmost part.
If our index is on a, b and c, our queries can use the whole index, a and b, or just a. But not c, and not a and c.
The equal + range rule
You can mix equality condition and range conditions in a query. You can easily find all Bakers (equal condition) whose name is greater than Tom (range condition).
But can you find all persons whose family name is smaller than Baker and whose first name is greater than Tom? Not quite. The index will help with the first condition, but not with the second. And the same applies for questions like “persons whose family name is smaller than Baker and equal to Colin”.
So, we use the index columns in order. And for each column we can do an equal search or a range search. But if we do a range search, we cannot use more columns in the index.
Or, less verbosely, we can say that the index usage stops at the first range.
There is nothing wrong about UNIQUE indexes, but there is something that we all should be aware about.
When we write into a table, the change is not applied immediately to the table file. It is buffered instead, into a memory structure called change buffer. In this way, the changes are written in background asynchronously, so they don’t affect the latency and don’t saturate IO. This is perfectly safe, because durability of changes is ensured by transaction logs, not table files.
The problem is that the changes to UNIQUE indexes cannot be buffered. InnoDB cannot assume that all entries of UNIQUE indexes are in memory, so it must immediately try to write the changes on disk and see if they fail.
I am not saying that UNIQUE indexes are wrong or you should be paranoid about them! But before creating them, check if they are useful. Are you creating a UNIQUE index because the applications could try to insert duplicates and you want them to fail – or just because you want to document that a certain column must not have duplicates? I saw the second case some times.
A common misconception is that foreign keys are indexes, and therefore they make JOINs faster. This is wrong, but it’s easy to find out where the error comes from.
First, a problem is the word key. In theory, a key is a component of relational algebra, so it should be thought as purely logical; an index is a data structure that helps performance. But in practice, a primary key is the most important index, and there’s no point in pursuing a mathematically pure language. Those who do tend to produce grothesque literature, see No Such Thing As “Primary Key Tuning” (I have nothing against the author).
Another problem is that when you create a foreign key and an underlying index is missing, it is created automagically. From here, the misunderstanding: sometimes foreign keys are thought to do some obscure magic that a normal index wouldn’t do.
Foreign keys add checks to the linked tables every time a write is performed. In many cases these checks are useless, but InnoDB cannot skip them because it cannot make assumptions on the application logic. But you can. In most cases, running a few smart realistic checks on the application side is faster than adding foreign keys.
Also, don’t assume that foreign key can add security to your application by rejecting some meaningless changes. Every user is free to violate foreign keys in this way:
SET SESSION foreign_key_checks := 0; DELETE FROM payment;
Foreign keys also:
- Are affected by interesting bugs in MySQL;
- Don’t work with some MySQL/MariaDB features;
- Prevent proper online migrations.
I’m not saying that foreign keys should never be used, but be careful about them. If you really want to use them ask your DBAs advise, as they are supposed to deal with the details of database technologies.
Consider the following
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: user partitions: NULL type: range possible_keys: ggx, ggz, i1 key: ggx key_len: 806 ref: NULL rows: 10020 filtered: 42.00 Extra: Using index condition 1 row in set, 1 warning (0.01 sec)
What are indexes
i1? It’s impossible to tell, without checking. This can be annoying. There should be a naming convention for database objects, including indexes.
A good index name would be something like this:
idx indicates that it is a normal index (you may have special suffixes for UNIQUE, FULLTEXT and SPATIAL), and it’s followed by the column names. Using a separator between names (as certain ORMs do) is often a bad idea, as you may hit the name length limit. Also, if you have good column names, there will be no ambiguity.
Indexes are essential for query performance. And fast queries are essential for application performance.
Often indexes are created instinctively – for example, an index on the email column usually looks like a good idea. Other times there is more thorough reasoning behind indexes, but still it’s easy to see the mistakes I listed in this article. I truly hope that this writing will help some people to avoid them.
To check how to find and drop useless indexes, see my next article.
What do you think? Were you aware of the information you’ve just read? Do you disagree on something? Do you know of other common mistakes?
As usual, please comment. I really enjoy when I see discussions, questions, more information, corrections…
Image credit: Wikimedia Commons