BLOB are MySQL variable-length types for long texts and long binary data. They are often misused. Sometimes they are used because they are an ORM’s default type for strings. Sometimes because the user thinks that
VARCHAR has a 255 length limit (which was true in very old MySQL versions, but not today).
In this article, I argue that sometimes these types should be avoided, and I talk a bit about good practices.
Choosing between TEXT and BLOB
BLOB are similar datatypes. The difference between them is that
TEXT has the notions of character set and collation, which allow a smart string comparison and string sorts. This means that MySQL knows the alphabetical order, whether case matters or not, if an accented A should be considered equal to a plain A, and so on. On the other hand,
BLOB is treated as a mere sequence of bytes, which works well – for example – for videos or audio contents, compiled code, and other data in binary format.
BLOB msy also work well for texts, as long as sorting and smart comparisons don’t matter, and the applications know how to decode the characters. For example, I would use
VARBINARY (the smaller version of
BLOB) to store hashes and checksums, because the only operation we need to do with them is a byte-by-byte comparison.
Choosing a size
In this article, I use generically the terms
BLOB. But there are actually several datatypes in this family:
Theoretically, one should always choose the smallest possible datatype. For example notice that
LONGBLOB is far bigger than the necessary in the vast majority of cases, and often even
TINYBLOB are big enough. But in practice, if you are not sure, you can just be pessimistic and choose the type you feel more comfortable with. You will not allocate more space on disk or memory, and there will hardly be a noticeable performance loss.
Do you need to store BLOBs in the database?
Now that we discussed the characteristics of
BLOB macro-types and the specific types, we know enough to ask ourselves a more important question: should we actually use
BLOB? In other words, do we really need to store images, audios, videos, binary data, etc, in the database?
Nowadays, most people would say no, and in most cases they are right. Storing media contents takes space on the same disk used to store the rest of the database, and this could be a problem if your data is big-ish and your disk is small-ish. Writing those data makes transactions longer, which could create performance or locking problems, and use more CPU time. Reading those data copies them into the buffer pool, using space that maybe would better be used for other, smaller values. Both writing and reading these data will consume network bandwidth. Replication may also suffer and lag because the writes to replicate are bigger.
In PostgreSQL there are no
BLOB types but one can use, respectively, big
bytea instead. Some of the previously mentioned concerns are also valid in Postgres. Another one should also be considered: from a physical point of view,
UPDATE statements create a copy of the affected rows. Therefore, storing big data in rows that are frequently
UPDATEd will make the table bloat (regardless which columns are modified).
For these reasons, people usually don’t store media data in relational databases. They store media files URLs instead – or not even that, if the application is able to find out the URLs automatically.
On the other side, there is a reason to store media data in relational databases. Imagine you store an image in a file server and its metadata in a DBMS. Transactions guarantee that all metadata will persist or the transaction will fail. The file server may give similar guarantees. But even in that case, you have two different transactions for a single logical data unit, which makes transactions substantially useless. So, depending on how much you value your data, you may still choose to store images in a DBMS.
EDIT: I realise I didn’t elaborate enough about why it is desirable to write/delete
BLOBs in transaction. See the comment below by Bill Karwin.
Should BLOBs be in a separate table?
InnoDB internally stores
VARBINARY types in a special way, because they can be big. If a value is shorter than 768 bytes, it is stored in the same memory page as the rest of the row. Otherwise, it is stored in one or more separate pages, depending on how big it is. In the row page, a 20 bytes pointer is written for every additional page.
This is true with the
COMPRESSED row formats.
DYNAMIC is the default with MySQL 8.0 and MariaDB 10.2, but
COMPACT was the default in older versions. This means that, if you didn’t set the row format to
DYNAMIC explicitly and you don’t use compressed tables, long
BLOB values are stored in-row.
The former method makes the row faster to read if you don’t read long
BLOB columns, because it can save a lot of IO. The second method makes all the reads on the table slower.
There is however one thing to note: most developers tend to always use
SELECT *, which reads all columns. As a consequence, even if InnoDB wisely stores long
BLOBs in separate pages, in practice many queries unnecessarily read those pages.
Also, as we mentioned, those queries will probably move
BLOB values into the buffer pool. And returning those values to the clients will increase the network traffic.
The wise solution is to convince developers to list the columns they need in their queries. But in my experience it is a very hard task. Even harder is to ask them to review old queries to fix this problem. Not only because they don’t want to do such a boring (and error-prone) task, but also because they cannot. The management decides their priorities, and those priorities are usually new features. Consolidation of existing code is usually considered as an optional effort that doesn’t create much value, in some places it’s almost regarded as a hobby. See my article about the important of setting proper SLOs, it should provide enough information to understand the economical value of code consolidation.
I’m not saying you should give up with query optimisations. Poorly optimised queries put a hard limit to the performance and scalability goals we can achieve. Developers should dedicate a part of their time to write queries properly, and refactor the ones that create issues.
That said, we need to be realistic. Suppose a development team comes up with a new table like this, that is going to be frequently read.
CREATE TABLE product ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, price DECIMAL(10, 2) NOT NULL, description TEXT, photo_front BLOB, photo_back BLOB );
Notice that there are three
BLOB columns. If we review the new tables before they make it to production, we can reach the developers and ask how often those columns will be accessed. Don’t expect that they can come up with precise information about the usage of a feature that is not yet released, unless they are time lords. But their educated guesses are often close enough. Suppose that the answer is:
photo_front is accessed by almost all queries on the product table, which is expected to happen several times a second, but
photo_back are only accessed by the query that returns the details on a single product, which will be called much less often.
Hopefully you can look together at some graphs about URLs usage and do some basic math, which will make the guess more educated.
So, something that we can do is to move
photo_back to a separate table to avoid the overhead caused by
SELECT * (again: no matter how better it is to avoid it, if you know that it’s not going to happen). The product details query will be a
JOIN, but we decided that we don’t care because it will happen seldom.
The problem with DEFAULT values
Only starting from MySQL 8.0 and MariaDB 10.2 it is possible to assign a default value to
BLOB columns. These versions significantly improved the default values handling. Older versions leave us with two options:
- Always specifying an empty string in our
INSERTs, if we don’t want to write
- Avoiding to declare the column as
NOT NULL. In this way, the default value will be
NULL is inconsistent in SQL, so I’d prefer to avoid it when possible.