Last Updated on
One of the trickiest problems in SQL is to query rows in tranches. This is typically needed in a website search, or in a dynamic catalogue where products appear as you scroll down. You want the DBMS to only return the results relative to a certain tranche.
The problem: OFFSET
From a strictly functional point of view, pagination is not really a challenge. The syntax may vary a lot depending on the DBMS, but with MySQL and PostgreSQL you can do something like this:
SELECT some_columns FROM some_table WHERE user_wants_to_see_this_product = 1 AND i_agree = 1 ORDER BY product_name LIMIT 10 OFFSET 30;
This example assumes that we show 10 results per tranche and we are visualising the third tranche.
The problem is that the OFFSET clause tells the DBMS to skip the first 30 rows. But this has two important implication:
- If there is not a suitable index to sort the rows, the DBMS has to copy all the values into a in-memory (or sometimes even on-disk) temporary table and order them. This is the first problem to solve, before talking about any other optimisation.
- Even if the DBMS is supposed not to return those rows, it has to read them.
When you first develop such a feature, chances are that you don’t have many rows in your table and the number of queries per second is low. But companies grow or fail. And if they grow, their products usage grow with them. How many rows will there be in one years, and how many queries per second? How many in two years?
My suggestion is to develop this kind of things right the first time. If your team didn’t, probably you’re reading this post because you reached the point where a refactoring is necessary.
EDIT: MySQL 5.6 and MariaDB 10.0 implemented an optimisation for
ORDER BY with a small
LIMIT. It uses a priority queue to order the rows in memory without reading them all. Note that if
OFFSET is big, the query will still be expensive.
EDIT: Another problem with
ORDER BY ... LIMIT is that the optimiser doesn’t properly keep them into account when deciding the table order for a
JOIN. This is mentioned in MDEV-8306, meaning that MariaDB at some point should fix the issue.
1,997,221 results found
Do you inform your users about how many results match their search? Is that number high? Well, I got uncomfortable news for you:
The number of results that match a search has no meaning for users.
And queries like these can be very expensive:
SELECT COUNT(*) FROM some_table; -- MySQL-specific: SELECT SQL_CALC_FOUND_ROWS COUNT(*) FROM some_table LIMIT 20;
The reason is that, to count rows, the DBMS has to read them all.
If you really want to give some information about the number of found results, you can do something like this:
SELECT COUNT(*) FROM ( SELECT id FROM some_table LIMIT 1001 );
In this way, the DBMS will not read more than 1001. If it returns 1001, you can print something like:
More than 1000 results found
Optimising pagination, sorted by a unique value
The easiest case is when you need to sort rows by a unique column. For example, when you list users by email or VAT code.
What we can do, in this case, is to remember the last value we showed in the current page. The next page will start from the next value.
SELECT some_columns FROM some_table WHERE <some_condition> AND email > 'firstname.lastname@example.org' ORDER BY email LIMIT 20;
In this way, InnoDB will quickly find the last showed value by traversing an index, and then it will read and return the next rows. There is no need to read all previous values.
Optimising pagination, sorted by a non-unique value
If the column that we use for sorting is not unique, we cannot use the query shown above. For example, if we order by
last_name, we could have several pages full of persons whose last name is Johnson.
The solution is to order rows by that column, plus the primary key. Each combination is unique, so this serves the purpose.
As explained in Primary Key in InnoDB: how they work and best practices, each secondary index contains the primary key. Incidentally, despite what I wrote in Indexes bad practices, this is a good reason to have indexes like
(last_name, first_name), even if formally the first is a duplicate of the second.
So, in this case, the query to use would be:
SELECT some_columns FROM some_table WHERE <some_condition> AND last_name >= 'Johnson' AND NOT (last_name = 'Johnson' AND id < 666) ORDER BY last_name, id LIMIT 20;
Note that the condition on
last_name uses the
>= operator, otherwise we could leave some rows out. But this includes the Johnsons we already showed (and we’d prefer to forget one of them), so we need the second condition to exclude them.
There is a less known equivalent syntax that we may prefer:
WHERE (last_name, id) > ('Johnson', 666) -- MySQL, SQLite WHERE ROW(last_name, id) > ROW('Johnson', 666) -- PostgreSQL
EDIT 25th June 2019: This originally contained a bug, thanks to Rick James for reporting it in the comment. See also his link.
What if items are inserted/deleted?
This problem is better discussed below, as it is harder to deal with if we want to cache some values. for now, let’s only consider which effects modifications have on a simple user search.
Maybe I’m visiting page 1. While I’m reading the page, an operator adds a product that should appear in this page, as first results. When I’ll visit page 2, I’ll see again one of the items in this page.
Or maybe a product I see is being deleted. Now the last result of this page should be the first of the nest one, but it’s not. When I go to page 2, I’ll skip an item.
The question is: does it matter? Maybe. But for most websites, I would say it doesn’t. There can always be inconsistencies when we use a search. It happens with storage engines, and no one complains too much. In any cases, there are inconsistencies that cannot possibly be fixed. If I click on the product that is being deleted, I will get an error like “this product does not exist”. How to solve this problem? In my opinion, the best thing to do is not solving it. Everything else could be an overengineering exercise.
Directly jumping to page X, or the last page
If a catalogue products apprear while the user scrolls down, we are giving them only one choice: seeing the next products. We already discussed how to handle this.
But in a paginated search, usually the user has the following links: first, previous, next, last, some page numbers. You can see an example at the bottom of Google result pages.
The query to show the first page is trivial:
SELECT some_columns FROM some_table WHERE <some_condition> ORDER BY last_name, id LIMIT 20;
The pages already visited (previous) can be remembered somehow, I will delegate this problem to developers, as the solution can be dependent on the used framework.
What about the last? And what about jumping to page 9? In these cases, we don’t know the last value of the previous page.
Do we really want to allow the users to skip to page 9? What is the purpose?
I can think only one possible purpose: skipping many results. This usually happens if the users knows more or less what they want, but they couldn’t find quickly enough a search functionality to find what they want – for example, persons whose last name starts with ‘C’. This problem shouldn’t be solved by adding imprecise navigation links that cause performance problems. This is a problem for UI designers.
So in my opinion, an acceptable solution is to only provide the links to the first page, the previous, the next, and the pages that the user already visited.
Solution 2: cache the values
What we need is to know the values that precede each page after the first. We can cache these values in a dedicated table or in some technology like memcached.
But keep in mind that keeping those values up to date can be a hard problem. When an item is added or deleted, the previous values should change. This operation could be slow, especially if we allow the users to order results by several column and in both directions. Even more if we allow to filter results.
Batching the changes during the night when the site traffic is low would probably solve the performance problem. But not all websites can do this. Also, sometimes an item deletion must happen immediately and cannot be postponed (maybe the product was inserted by mistake and doesn’t exist).
Discussing a good solution is beyond the purpose of this article.
As always, I’d like to hear your thoughts. Do you disagree on something? Do you have thoughts regarding the problems I mentioned without proposing a solution?
Photo credit: Steven Johnson
(not related to the above examples, as far as I know)