Last Updated on
Backups fail, just like everything else. Sometimes it’s a one-time failure, sometimes our backup automation stops producing usable backups and we have a permanent failure. And if… pardon, when it happens, there are two options. First: you test backups, you notice the problem immediately, you solve it. No service outage or data loss. Second: you don’t test your backups because, come one, it’s very unlikely that they fail and you have better things to do.
Yes, sure. Ask GitLab. Everyone remembers their incident in 2017: while trying to fix other problems they deleted a database, and weren’t able to restore the last six hours of data: Out of five backup/replication techniques deployed none are working reliably or set up in the first place. The most recent backup was a snapshot that someone has taken manually six hours ago for reasons unrelated to the incident – yes, they were lucky he did.
Quite impressive, right? But this is not science fiction, it happened. It happened to an organisation with a great reputation. Well, this is exactly what I try to avoid when I say that backups need to be tested.
A backup strategy should only be considered in place if it recently proved that it works.
What to monitor
In case of a disaster, backups need to be restored in a certain amount of time (a Recovery Time Objective based on your Service Level Objectives). If we break it down, it means that we need to be sure that:
- The script that creates the backup reports success (by convention, a 0 exit code);
- The backup exists, in the place where the restore procedure expects to find it;
- The backup is not empty;
- The restore procedure is able to restore the backup;
- The procedure completes in a time that is no higher than the RTO.
The important points are 4 and 5. The points 1, 2 and 3 are redundant. However, they are convenient to monitor.
In the next paragraph we’ll see how to test an existing backup. But suppose that generating the backup takes three hours. Before three hours, the tests cannot even start. But if the backup is not created at all, it is empty, or the backup script fails with an exit code other than 0, this is something we can detect early.
How to test backups
If a backup file exists, it is in the right place and it has a reasonable size, we need to check if it can be restored. There can be many possible reasons why it can’t – the most obvious being corrupted backups, corrupted original databases, not enough disk space and version mismatch. Give that, in order to test if a backup is usable we need to actually restore it.
An obvious way to do this would be to restore the backup on a test server and then run basic tests to check if the queries are readable. This means that, for each backup, we would need a test server that is not used for anything else. This is is not great.
Many organisations, even small ones, need to periodically feed some servers from production: for example staging, test, development databases, maybe even some databases used by analysts and data scientists. Feeding them every night from backups is a good idea. If we have multiple backup strategies (and we should!), each strategy should be responsible for feeding different sets of services. If restoring some backups takes too long and cannot happen during the night, it could happen during the weekend instead. Basic automated tests should be run anyway – to check that all tables and other database objects are usable. In the unlikely case that those tests are not enough, hopefully some developers will report an anomaly (a corrupted trigger, etc).
The above idea is easily usable for databases that don’t contain any PII. However, regulations like the GDPR require that test databases are anonymised and backups are encrypted. It is perfectly reasonable to only encrypt a backup after it’s tested, but we need to be sure that no unexpected event can leave our backups unencrypted on some server. Anonymisation is a more complex problem, but not too complex as far as backups are concerned. Complete, non-anonymised backups can still be restored to intermediate servers that the final users (developers, analysts, data scientists…) don’t have access to. Then they will be anonymised and, if the whole process succeeds, their contents can be transferred to the test servers and analytics servers.
The forget procedure
Let’s not forget, however, the RTBF (Right To Be Forgotten). The GDPR requires that a user can required to be completely removed from our databases. The problem is that we need to apply the forget procedure every time we restore a backup, to be sure that the forgotten users are not restored. We need to automatically test this procedure after every restore. I don’t believe it is necessary to invalidate a backup if the forget procedure doesn’t work; but a non-critical alert should be triggered and the problem should be fixed as soon as possible.
Retrieve old backups
The latest backup (and optionally multiple recent backups) should be stored in a place which is close to production. This will allow to quickly copy them to the right place and restore them, when needed.
But older backups are typically stored in some remote cloud service, like Amazon S3 or in-premise NAS. Those old backups are rarely needed and they are probably retrieved manually. Nevertheless, if your company spends a considerable amount of money to keep them safe they must be considered important. It is a good thing to regularly – but not frequently – test the retrieval procedure, if nothing else.
The same considerations apply if the backups are written to a tape.
A last test should not be forgotten, even if it’s not strictly related to backups. The monitoring and alerting systems need to be tested. We said that a backup should only considered in place if it was recently tested; the same should be said about alerts. Test them regularly, during working hours. Be sure that everyone know that the test alert is actually a test. Send automatic emails and chat messages as a reminder. People should know that they don’t need to worry if they receive a critical alert on Monday at 14. But someone should definitely worry if that alert is not triggered!
- Database Health Checks – After a review of your database status, I will check if your backup strategies are appropriated and well tested, and will recommend how to fix problems you may have
- Monthly or Weekly Time – We can surely use this time to improve/maintain your backup procedures and other part of your infrastructure
- Contact me for a consulting aimed to setup proper backup strategies and the necessary tests.
A MySQL Health Check will check if you are taking automatic backups properly, and will recommend how to test your backups properly.
Do you have experiences or thoughts to share? Do you have alternative solutions you want to share? Did you find anything wrong in the article? Please comment!
Comments to my articles make this website not just one person’s texts, but also a more collective knowledge base.