Returning random rows in a random order from a relational database is a very hard problem. In this article we will see an antipattern, we will discuss why this problem is so hard, and we’ll examine some imperfect solutions.
UPDATE: Some DBMSs support
TABLESAMPLE. I did not mention it because until today I only knew the SQL Server implementation, and I don’t consider it good for extracting random rows (I’ll explain why). But after a question on Facebook by Daniël van Eeden I made some research, and I discovered that PostgreSQL implementation could be much more suitable. I’ll more read about it and try it, and then I will probably write a second article about it. Anyway, MySQL and MariaDB don’t support
The wrong way: ORDER BY RAND()
Many developers think they can do it in this way:
SELECT some_columns FROM some_table ORDER BY RAND() LIMIT 10;
This is the only query that does exactly what expected. If the
RAND() function is reliably random, it returns a random set of rows, where each row has the same chances to be returned.
The problem is how it works under the hood. To order random values, one has to create them first. To create a random value for each row in a table, a DBMS has to read each row. So the DBMS will do a two steps operation:
- Copy the rows plus a random value to a temporary table, hopefully in memory;
- Order the rows in the temporary table.
Note that this detail: hopefully in memory. Depending on the DBMS, there could be reasons why the table needs to be created on disk. Typically, this happens if the table is too big. Old MySQL versions also create temporary tables on disk if they contain
Needless to say, if the table is big this is a slow operation that consumes a lot of resources.
Why don’t relational DBMSs provide a built-in way to return random rows?
Relational databases are built around a branch of mathematics called relational algebra. It is based on logic, because it deals with true/false predicates; ans similar to the set theory, even if a relation is something more complex than a set, and even seeing it as a table is not mathematically correct. This article is a nice example of what I mean, but there are more complex arguments.
Despite SQL being not a precise implementation of the relational model, its algebra is used as a reference. Like all branches of mathematics, it has a well-defined set of operations. Getting random tuples from a relation is definitely not one of them.
This answer sounds a bit too theoretical, right?
ORDER BY and tables without a primary key also violate the rules of relational algebra, but SQL supports them.
However, for databases, speed is vital. Relational DBMSs are designed to perform relational operations very quickly. Obviously, a data structure designed to optimised certain operations is not optimised for other operations. So, relational DBMSs are not optimised for random searches.
It should be obvious that looping over the rows to choose some of them randomly would be slow. Theoretically, reading a random row using an index would be as fast as finding a row by the primary key value. But then, different rows would have different chances to be chosen.
The above image is from a Jeremy Cole’s GitHub repository, innodb_diagrams. Sooner or later I should write about his projects to analyse InnoDB data structures.
What is important here is that an index is a tree whose nodes are memory pages. The leaf nodes contain actual data; the upper levels only contain information needed to choose the shortest path to the desired leaf node. Details are not important in this context.
The DBMS could theoretically choose a random page by performing a series of random decisions: will I go left or right? But then, leaf pages contain a variable number of rows. They could even be empty. So, different rows would have different chances to be selected.
The result is clearly outside of DBMSs intended use cases; and it can be slow or wrong, depending if an index is used or not. The DBMSs rightly chosen not to implement this feature.
There are no perfect solutions. There are solutions that can be acceptable but have drawbacks. We should choose one from case to case.
Another way: WHERE
This solution is applicable to tables that have an autoincremental primary key. It is the easiest to implement, but a “fragmented” primary key causes problems.
It consists in checking the minimum id, the maximum id, and generating a random number in that range:
SET @min := (SELECT MIN(id) FROM some_table); SET @max := (SELECT MAX(id) FROM some_table); SELECT some_columns FROM some_table WHERE id = FLOOR(@min + RAND() * (@max - @min));
Not that queries with
MAX() on any index, without other clauses, are very fast.
The result is a truly random row. To choose more than one row, we can repeat this query multiple times. This leads us to two drawbacks:
- If this operation is performed often, we could run this query a lot of times.
- There is no guarantee that the same row isn’t selected multiple times. If this happens, the application may retry the query. If the table is big, this event is rare and we can ignore it – unless the
RAND()function is unreliable. But for small tables, this can be a problem.
Also, a primary key may have holes – in other words, some numbers may be missing. This is due to deletions and failed transactions. Numbers lower than the maximum are not reused. To work around this problem, we can use two techniques.
If we use the
>= operator in the
WHERE clause, the query will always return a row. If the generated number is not present, the next one will be used. But this means that values immediately after a hole will have more chances to be selected. The bigger the hole, the higher the chances.
You may think that you shouldn’t care about this, and you’re probably right. But before your decision is final, consider the possibility of a big hole.
If the query returns no rows, we can just retry. For big tables with just a few small holes, this is perfectly fine. But for big holes with many holes, or big holes, this technique could consume too many resources.
Please don’t be too paranoid about this. Just do some math. What is the ratio of missing values in your table? This tells you how many times the query will be retried, by average.
Selecting multiple rows
To select multiple rows with this technique, we can:
- Repeat the query multiple times;
- Or, for better performance, use the
>=operator and use a
LIMITgreater than one. This however makes the results much less random.
Moving the random choice to the application
In its simplest form, this technique consists in randomly choosing a block of contiguous id’s. The application will randomly choose which ones it will use, and then select the matching rows.
Suppose you want to choose from blocks of 1000 values:
SET @block_size := 1000; SET @min := (SELECT MIN(id) FROM some_table); SET @max := (SELECT MAX(id) FROM some_table) - @block_size; SELECT id FROM some_table WHERE id = FLOOR(@min + RAND() * (@max - @min)) ORDER BY id LIMIT @block_size;
The biggest the block, the more resources you consume, the more random is the result. The application has the responsibility to choose random unique existing id’s from the blocks, so there should never be the need to retry a query.
The MOD variant
A problem with this technique is that the rows are returned from the same block of contiguous id’s. This could be perfectly acceptable, especially if the block is big. But if it is not, we could add some complexity to select a block of non-contiguous rows.
The idea of this variant is to:
- Define macro-blocks of mostly non-contiguous rows;
- Randomly choose one macro-block;
- Return a block from it.
To do this, we can calculate the id’s modulus, and use it to select random results. For example, we can define an indexed virtual column with an expression like
(id MOD 4). Each row will have a value between 0 and 3. Even if there are holes, rows from these macro-blocks will mostly be non-contiguous and they should be approximately the same number.
The application should generate a random number between 0 and 3. Then, to select blocks from these macro-blocks:
SET @block_size := 1000; SELECT id FROM some_table WHERE id_mod = FLOOR(RAND() * 3) ORDER BY id LIMIT @block_size;
A similar way to implement this idea is to use partitions, where the partitioning function is the id’s modulus. I would suggest this only if the table would benefit from partitioning for other reasons.
Choosing a technique
The block techniques runs more queries compared to the first one with the
>= operator, and the result could be better or not. Compared to the retry variant, this could run more or less queries, and the result is worse.
Generally, the block techniques are better if we have holes in the primary key.
Caching random series
The previous methods are a tradeoff: performance versus better randomness, where faster solution are also more complex to implement. As long as we don’t use the
ORDER BY RAND() antipattern, selecting random rows shouldn’t cause performance problems. But what if it does?
We could cache some randomly sorted id’s obtained with any of the previous solutions. We could use a query like this (MySQL syntax):
CREATE TABLE random_series ( id INTEGER UNSIGNED AUTO_INCREMENT, random_id BIGINT UNSIGNED NOT NULL, PRIMARY KEY (id), UNIQUE unq_random_id_id (random_id, id) );
We could use the series in ascending or descending order, randomly. Or the application could shuffle them.
- When an item is added, we should choose randomly if it must be inserted in a series, replacing a random item.
- When an item is deleted, we should delete it from all series. It should be replaced by another random item.
- We may want to periodically delete the oldest series and replace them with new ones, to improve the randomness in the user experience.
As usual, please comment and let me know if you spot a mistake in this article, if you disagree on something, tell me your personal experience, or if you have more ideas to add.
Often your comments are extremely interesting. My articles alone are just (hopefully interesting) articles. But together with your comments they form a valuable knowledge base.
Photo credit (CC0, public domain)