The perils of ALTER TABLE in MySQL/MariaDB

Modifying a table
An ALTER TABLE operation

ALTER TABLE is often necessary, but it can be risky in a production environment for many reasons. It is a destructive command, meaning that it can delete some data and cannot be rolled back. It acquires locks that can cause problems, especially if you have some form of high availability. It can obviously be a quite heavy operation. Also, some MySQL and MariaDB features are not designed to allow safely change table structures.

There are solutions for these problems, and good practices to follow when altering tables. For example, it is often a good idea to use a tool like pt-online-schema-change or gh-ost to modify tables (and sometimes it is a good idea not to use them). Specifying options like ALGORITHM and LOCK also helps us to gain more control. But these topics will probably be discussed in a future article.

For now, let’s focus on the risks instead. It’s important to understand them, before deciding what to do or how.

This is just a quick reference. It could be incomplete and all details are missing. If you would like to know more about some of these points, or if you have some more risks to add to this list, please write a comment below.

EDIT: In MariaDB the use of external tools for ALTER TABLEs may be obsolete after MDEV-16329 Cross-engine ALTER ONLINE TABLE is completed and released. Thanks to Marko Makela for reporting this in the comments.

Disk space

Some ALTER TABLE operations require a table copy. Before starting, we need to be sure that we have enough space on disk. If we have slaves, we should be sure that our slaves also have enough disk space.

When you add a column or index, a table will permanently grow. When you add a column, an index or a table, new rows will require space permanently. These are good reasons to use the correct data types. While for each single column using the correct type and size could be irrelevant, it will probably be relevant to choose the correct type and size every time we make a change, as an organisation rule.

At the same time, don’t be too paranoid about space. If you create columns that are too small, a migration will be necessary in the future to fix the problem. And if your application uses a non-restrictive SQL_MODE, it is very possible that your data will be truncated and no one will note that.

Locks

Some ALTER TABLE types are online. The exact set of online operations depends on your MySQL version, and they are not the same between MySQL and MariaDB. Some operations are even instantaneous.

But others will lock DML statements. Imagine you write to a table several times a second, and you need to alter the table… but the operation will last for hours. This is a potential disaster, so the migration needs to be planned.

Galera

Galera has two methods to run ALTER TABLE: RSU and TOI. RSU can potentially avoid a cluster downtime, but it is risky. For example, it allows to write on node1 while you add a new column without a default value on node2. But the in this case, the INSERTs will fail on node2.

Replication

Even a simple SELECT acquires metadata locks. This is expected, because we don’t want a column to disappear in the middle of a query. But imagine we have a master and a slave. We run long-running queries on the slave. At some point, we run an ALTER TABLE on the master and, after it completes, it is propagated to the slave. But long-running queries will block the ALTER TABLE. And replication will lag for… how much? Hours? Maybe it’s more than we can tolerate.

A possible solution is to run the ALTER TABLE on the master and the slave separately, setting sql_bin_log=0 at session level so that the operation will not be replicated. Notice that this is risky if you don’t know what you are doing: adding or dropping a column on the slave can make some queries fail even if they succeeded on the master.

EDIT: MDEV-11675 will allow true lock-free ALTER TABLE in MariaDB slaves. Thanks to Marko Makela for reporting this in the comments.

Replicate to different tables

A slave can have more columns than the master for a certain table, as long as those columns are added at the end of the table. The master can have more column than the slaves, if they are added at the end of the table. Replication will not take these columns into account.

Probably the astute reader already guessed what the problem is. If we use this feature, we must be sure that non-common columns remain at the end of the table. By default ALTER TABLE adds new columns at the end of the table, so we need to be very careful to use the AFTER clause when needed. And we must be sure that extra columns are in the master or in a slave, not both.

Of course the simplest solution is not to use this feature.

Adding columns

A source of many problems for IT people is that there is no way to deploy code in the exact moment an ALTER TABLE finishes. If you add a column, INSERTs should be rewritten accordingly. But old INSERTs will fail on the new table, new INSERTs will fail on the old table, and the code change cannot be coordinated with the DBMS.

A clean solution is to add columns with DEFAULT value. In this way old INSERTs will not fail. INSERTs can be adjusted later.

Dropping columns

For the same reason, dropping columns can also be problematic. If we drop the column too early, all statements that mention the column will fail. We can drop it later, but only if the column has a DEFAULT value, otherwise INSERTs will fail. If the column doesn’t have a DEFAULT value, we should add it, then deploy the new code, then drop the column.

Dropping indexes

MySQL supports the USE/FORCE/IGNORE INDEX syntax. We basically list some indexes that should be used or ignored, no matter what MySQL thinks about it. But if we drop an index that is named by queries in this way, the queries will fail. This is true even in the case of IGNORE INDEX.

Adding indexes

I think that MySQL is quite good at choosing the right index to use. But sometimes it can do horrible mistakes, and a query can be unnecessarily slow. For that reason, adding an index sometimes implies some risks. Unless we need to quickly add an index to fix a performance problem, it is a good idea to add indexes in staging first.

Changing the storage engine

ALTER TABLE allows to change a table’s storage engine, for example from InnoDB to BLACKHOLE. But we should keep in mind that some features, like foreign keys and virtual columns, are only supported by some storage engines. When we try to change the storage engine and those features are present, the ALTER TABLE may fail; and that’s the best case. But it could also succeed. If we have foreign keys, for example, they will silently disappear.

No CTRL-Z

Unfortunate code deploys can be reverted. Schema changes cannot. If a migration contains a mistake, the problem can only be fixed with another migration. Developers should be careful with schema changes, and DBAs should always review these changes.

This aspect is particularly scary when it comes to run DROP TABLE or ALTER TABLE ... DROP COLUMN statements, because they destroy data. This is the reason why I wrote the article Dropping database objects in a safe way.

Some software handle database structure versioning and migrations. They generally have the concept of a down-migration – a way to revert the last migration. Some ORMs have this feature. As long as the migration is written in pure SQL, this is good. It makes sure you run exactly the same ALTER TABLE in all environments (production, staging, developer machines…) without taking a new backup from production. Keep in mind that it’s not magic: data changed or destroyed by a down-migration cannot be restored with an ALTER TABLE.

Temporal tables

MariaDB temporal tables (or system-versioned tables) keep the history of each row. But they don’t keep the history of table structures. If it is modified, history is somehow changed accordingly. If a new column is added, it will be added to all old rows. If a column is dropped, its value will disappear from the history. By default, ALTER TABLE is not allowed on temporal tables.

See also

Related articles:

Related courses:

Conclusions

If you encountered other types of problems with migrations, please let me know with a comment, I am curious.

In later articles we will see what can be done to mitigate the risks.

A personal note

Do you have big tables? Write-intensive tables? You don’t feel confident enough to run heavy ALTER TABLEs on them?

Don’t get stuck with the current version of your schema, putting a hard limit to innovation and development. I can help you with migrations, and other delicate tasks. Please check my services.

Photo credit

Comments (10)

  1. I do a lot of work on the “schema change tool” at Etsy. It tries to execute ALTER without taking a lock by adding LOCK=NONE to the ALTER. If this fails for some reason (table operation requires a lock, DML log for table too small, duplicate key in unique index, etc) the tool falls back to pt-osc. It works well for us, and schema changes are generally non-disruptive.

      1. There is a bug in my design and implementation of online ALTER TABLE, which originally appeared in MySQL 5.6. As noted in one of my comments in https://jira.mariadb.org/browse/MDEV-16329 (for a cross-engine ALTER ONLINE TABLE that would hopefully make external schema change tools obsolete), MDEV-18127 reported a case where a table-rebuilding ALTER TABLE can fail if a concurrent DML transaction was rolled back due to unique index violation. It would be better to defer the logging of concurrent DML operations to their commit time.

        I can imagine that another important reason for schema change tools to exist is that on replication slaves, ALTER TABLE will not be initiated before the master has committed it. That is, replication will lag by the duration of the ALTER TABLE operation. This should be eventually fixed in https://jira.mariadb.org/browse/MDEV-11675 by introducing ‘start’ and ‘rollback’ events.

        In MariaDB 10.3 (and MySQL 8.0), ADD COLUMN can be executed instantaneously. MariaDB 10.4 also implements instant DROP COLUMN and changing the order of columns.

        1. Thank you Marko. I will update the article with the references to the MDEVs later.

          Yes, I confirm that the reasons to use tools are locks and replication (though Justin mentioned another in his comment). I really hope that when those MDEVs are completed, we will slowly forget about those tools.

          MySQL also has instant ADD COLUMN now, but there are a lot of old versions around.

  2. pt-osc will remove duplicates, which is fine as long as you understand that it is going to do that. Given that our data is sharded, UNIQUE indexes are very rare, and we’ll check for duplicates before adding one.

    1. Thanks for pointing that. I don’t think that many companies run the risk of executing incompatible DDL at the same time, but the rest is useful for a lot of people. I think I’ll add a couple of those points here.

Leave a Reply

Your email address will not be published. Required fields are marked *