TEXT and BLOB good practices

Last Updated on

A poster of Blob, with Steve McQueen
According to Steve McQueen, it’s better to avoid BLOB

TEXT and BLOB are MySQL variable-length types for long texts and long binary data. They are often misused. Sometimes they are used because they are an ORM’s default type for strings. Sometimes because the user thinks that VARCHAR has a 255 length limit (which was true in very old MySQL versions, but not today).

In this article, I argue that sometimes these types should be avoided, and I talk a bit about good practices.

Choosing between TEXT and BLOB

TEXT and BLOB are similar datatypes. The difference between them is that TEXT has the notions of character set and collation, which allow a smart string comparison and string sorts. This means that MySQL knows the alphabetical order, whether case matters or not, if an accented A should be considered equal to a plain A, and so on. On the other hand, BLOB is treated as a mere sequence of bytes, which works well – for example – for videos or audio contents, compiled code, and other data in binary format. BLOB msy also work well for texts, as long as sorting and smart comparisons don’t matter, and the applications know how to decode the characters. For example, I would use VARBINARY (the smaller version of BLOB) to store hashes and checksums, because the only operation we need to do with them is a byte-by-byte comparison.

Choosing a size

In this article, I use generically the terms TEXT and BLOB. But there are actually several datatypes in this family:

  • TINYTEXT / TINYBLOB
  • SMALLTEXT / SMALLBLOB
  • MEDIUMTEXT / MEDIUMBLOB
  • TEXT / BLOB
  • LONGTEXT / LONGBLOB

Theoretically, one should always choose the smallest possible datatype. For example notice that LONGBLOB is far bigger than the necessary in the vast majority of cases, and often even TINYTEXT or TINYBLOB are big enough. But in practice, if you are not sure, you can just be pessimistic and choose the type you feel more comfortable with. You will not allocate more space on disk or memory, and there will hardly be a noticeable performance loss.

Do you need to store BLOBs in the database?

Now that we discussed the characteristics of TEXT and BLOB macro-types and the specific types, we know enough to ask ourselves a more important question: should we actually use BLOB? In other words, do we really need to store images, audios, videos, binary data, etc, in the database?

Nowadays, most people would say no, and in most cases they are right. Storing media contents takes space on the same disk used to store the rest of the database, and this could be a problem if your data is big-ish and your disk is small-ish. Writing those data makes transactions longer, which could create performance or locking problems, and use more CPU time. Reading those data copies them into the buffer pool, using space that maybe would better be used for other, smaller values. Both writing and reading these data will consume network bandwidth. Replication may also suffer and lag because the writes to replicate are bigger.

In PostgreSQL there are no TEXT and BLOB types but one can use, respectively, big varchars and bytea instead. Some of the previously mentioned concerns are also valid in Postgres. Another one should also be considered: from a physical point of view, UPDATE statements create a copy of the affected rows. Therefore, storing big data in rows that are frequently UPDATEd will make the table bloat (regardless which columns are modified).

For these reasons, people usually don’t store media data in relational databases. They store media files URLs instead – or not even that, if the application is able to find out the URLs automatically.

On the other side, there is a reason to store media data in relational databases. Imagine you store an image in a file server and its metadata in a DBMS. Transactions guarantee that all metadata will persist or the transaction will fail. The file server may give similar guarantees. But even in that case, you have two different transactions for a single logical data unit, which makes transactions substantially useless. So, depending on how much you value your data, you may still choose to store images in a DBMS.

EDIT: I realise I didn’t elaborate enough about why it is desirable to write/delete BLOBs in transaction. See the comment below by Bill Karwin.

Should BLOBs be in a separate table?

InnoDB internally stores TEXT, BLOB, VARCHAR and VARBINARY types in a special way, because they can be big. If a value is shorter than 768 bytes, it is stored in the same memory page as the rest of the row. Otherwise, it is stored in one or more separate pages, depending on how big it is. In the row page, a 20 bytes pointer is written for every additional page.

This is true with the DYNAMIC and COMPRESSED row formats. DYNAMIC is the default with MySQL 8.0 and MariaDB 10.2, but COMPACT was the default in older versions. This means that, if you didn’t set the row format to DYNAMIC explicitly and you don’t use compressed tables, long TEXT and BLOB values are stored in-row.

The former method makes the row faster to read if you don’t read long TEXT and BLOB columns, because it can save a lot of IO. The second method makes all the reads on the table slower.

There is however one thing to note: most developers tend to always use SELECT *, which reads all columns. As a consequence, even if InnoDB wisely stores long TEXT and BLOBs in separate pages, in practice many queries unnecessarily read those pages.

Also, as we mentioned, those queries will probably move TEXT and BLOB values into the buffer pool. And returning those values to the clients will increase the network traffic.

The wise solution is to convince developers to list the columns they need in their queries. But in my experience it is a very hard task. Even harder is to ask them to review old queries to fix this problem. Not only because they don’t want to do such a boring (and error-prone) task, but also because they cannot. The management decides their priorities, and those priorities are usually new features. Consolidation of existing code is usually considered as an optional effort that doesn’t create much value, in some places it’s almost regarded as a hobby. See my article about the important of setting proper SLOs, it should provide enough information to understand the economical value of code consolidation.

I’m not saying you should give up with query optimisations. Poorly optimised queries put a hard limit to the performance and scalability goals we can achieve. Developers should dedicate a part of their time to write queries properly, and refactor the ones that create issues.

That said, we need to be realistic. Suppose a development team comes up with a new table like this, that is going to be frequently read.

CREATE TABLE product (
    id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(50) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    description TEXT,
    photo_front BLOB,
    photo_back BLOB
);

Notice that there are three TEXT or BLOB columns. If we review the new tables before they make it to production, we can reach the developers and ask how often those columns will be accessed. Don’t expect that they can come up with precise information about the usage of a feature that is not yet released, unless they are time lords. But their educated guesses are often close enough. Suppose that the answer is: photo_front is accessed by almost all queries on the product table, which is expected to happen several times a second, but description and photo_back are only accessed by the query that returns the details on a single product, which will be called much less often.

Hopefully you can look together at some graphs about URLs usage and do some basic math, which will make the guess more educated.

So, something that we can do is to move description and photo_back to a separate table to avoid the overhead caused by SELECT * (again: no matter how better it is to avoid it, if you know that it’s not going to happen). The product details query will be a JOIN, but we decided that we don’t care because it will happen seldom.

The problem with DEFAULT values

Only starting from MySQL 8.0 and MariaDB 10.2 it is possible to assign a default value to TEXT or BLOB columns. These versions significantly improved the default values handling. Older versions leave us with two options:

  • Always specifying an empty string in our INSERTs, if we don’t want to write
  • Avoiding to declare the column as NOT NULL. In this way, the default value will be NULL.

But NULL is inconsistent in SQL, so I’d prefer to avoid it when possible.

Toodle pip,
Federico

Comments (4)

  1. “people usually don’t store media data in relational databases”

    Thats what programmers always repeat and what my professor told me. But that is not true any more. Come on! We have 10G network ports, blazing fast SSDs and more memory than we usually need. Generations have wasted so much time by slowly rsyncing data across datacenters or by using faulty behaving distributed file systems.

    Of course I store all my files in my database! That makes data handling so much easier. There is only ONE API of truth, where you can build your architecture around 🙂 Keep it simple. A car owner wouldnt put diesel in a gasoline car, right? So you have to know the database mechanics, because inline blobs are tricky, as you pointed out.

    After all, one downside remains: Big backups. But you know what? This is a task I proudly take care of!

    1. Hi Uli.

      I agree that, if you don’t have reasons for moving your files out of the database, you should keep them there. I also mentioned a possible reason to keep files in the database.

      But every case is different. Consider this. If you have an EC2 *.large instance, you may have 4G or 8G of ram. In the latter case, you probably have 6G of buffer pool. Not enough to fill it with big files.
      And the guarantee bandwith is 3500 Mbps. Which means that, to transfer a 1G content, the instance will need more than 2 seconds. But in the meanwhile other queries are served; so in practice queries will be server concurrently, but the network will be the bottleneck.

  2. Other reasons to put blobs in the database:

    – Updating or deleting content can be done with ACID qualities. You can roll back, updates aren’t visible until you commit, etc.
    – If you delete a record, you don’t have to remember to delete the content from a fileserver as a separate step. What if you forget? How do you know which files on your fileserver are “orphans?”
    – Replicating a database to a failover instance ensures all your content is replicated too. How can you be sure of that if you don’t store content in the database? There are some solutions like automatic sync of S3 buckets to multiple regions, but it’s an extra thing to set up and monitor.

    I agree putting big blobs in the database makes the database big, and make the backup big, and makes result sets from SELECT * big too.

    Also it makes binary logs big when you update a row that includes a large blob. We had a problem recently with row-based binlog including the whole blob, even when the blob wasn’t updated. We had to change to binlog_row_image=noblob or minimal. Another strategy would be to split the table, so blobs are in a separate table with a 1:1 relationship.

    There are pros and cons to every choice.

    1. Hi Bill.
      I agree on everything – see the EDIT above about transactions.
      Very good point about the binlog. Those who use Debezium or similar software must use binlog_row_format=FULL, so all BLOBs are logged…

Leave a Reply

Your email address will not be published. Required fields are marked *