Deploying DB migrations with confidence

What role does your database play in your CI/CD?
How long does it take for your devs to get a running database?
How long does it take to recover a dev-database in case of an accidentally destruction?
How current are the database snapshots your devs use?
How confident are you with schema updates going to production?

These days DevOps conferences and talks are filled up with containerisations, docker, k8s, auto-scaling, auto-healing, ci/cd, agile, etc – disappointingly however, most of them only touch stateless environments and far too seldom do engineers share their knowledge on running a database in a CI/CD environment & workflow.

In this blog article I will give some info on how we solved it at my current place – which mainly consists of running a web-application based on Laravel on a traditional “LAMP”-stack.

I’ll be honest, we had a rough start. We used to have a shared AWS RDS for our QA & Staging environments which was also then be used by developers to connect their local workspace to that remote MySQL instance and be able to view the webapp locally with proper, non-seeded data – which sometimes is also just essential to debug and fix certain types of reported bugs.

So, our current state kinda worked, but was super unreliable. It was enough for a dev to accidentally drop the database or a botched staging deploy to suddenly kill the workflow of whole team. Restoring took over an hour (mix of larger-than-your-usual-wordpress-database and budget-restrictions-on-dev-instances).

Obviously this was super annoying and had to change. So I went over to my friends at the ZATech Slack channel but I was quickly hitting a wall, and in contrary it seems like I stepped on some people toes: I learned my lesson, never mention “on-premise” nearby a DevOps engineer (causes hefty allergic reactions)…

Basically the following two statements were made:

  • everything is in the cloud, no on-premise or no local databases
  • DB should be part of the CI/CD

It was difficult for me to agree on the first point, being based in South Africa there are absolutely no proper cloud providers – next hop is AWS London. And anyone who has ever connected his local webapp to a remote MySQL knows how quick a higher latency (>10ms) can make working locally a pita.

While I do agree that the DB should be part of the CI/CD, there is still a huge benefit (especially in efficiency and speed) when developing locally – and also not having to rely on seeded data.

Disappointed of no solutions I decided to go on my own against all odds, and with the support of our CTOs + wonderful person in our finance to allocate some budget for on-premise hardware (specs for the geeks like me: i7-6700 / 64GB RAM / 4x 256GB SSD @RAID10 / UPS).

Step 1: create a database service

We will use the database service to actual host the databases. I use Jenkins to nightly run a simple downstream job that mysqldump‘s production database (it ignores some larger tables that are not needed), anonymises the data (emails + mobile-numbers), pushes the dump to a predictable location (which is accessible internally by devs).

From there, the database service will launch three (one shared amongst devs, one for experimental tests cases / usage, one for our automated builds – see step 2) VMs that have MySQL running on them, import the above dump, then create a snapshot of the storage drive. I use Virtualbox as I had extensive experiencing using it in a programmatic way, but if I’d redo the architecture I would most probably do it with libvirt/qemu.

I created a small web interface as well:

database services (dbs)

With database services (dbs) the following goals have been achieved:

  • a developer has access to an anonymised production database that is never older than 24hrs
  • the dev can either download the dump and run it on its local machine, or directly connect to “dbs” (database services) – which will be especially fast from within the office
  • due to the usage of snapshots, should anything happen to the database it is possible to restore the state of last night in less than a minute (!!) – which is much faster than any AWS RDS snapshot restore and it does not involve any config changes (e.g. in-place restore)
  • Staging & QA still use a shared DB in the cloud, however due to the separation, issues on either side do not interfere with the whole team

Dbs has been running for quite a while and it solved a good amount of issues. However we were still getting the occasional botched staging deploy or failed master-build due to us only running a very optimistic/superficial check on database migrations.

This is due to us only running artisan migrate ( against a empty database in our CI builds (for predictability reasons). Meaning, builds would only fail if there was a PHP or SQL syntax error, not if the migration itself were faulty on production data. The easiest way to demonstrate a fail would be to add a unique-index on a column – perfectly fine on a empty database, not so much on production with potential duplicate values already existing.

Step 2: run builds against prod data

The safest way to make sure that your database migrations are sound & proof is to actually run it against production data, as that is what will be ultimately the case on a production deploy anyways.

Fortunately we do not need to run every build against prod snapshot, as we are only interested if anything within the /database/migrations/ folder changes.

I created an additional Jenkins job that runs on every PR and with the help of a little bash + the Github API, I can check if a migration was actually part of the code changes or not, and only then will the build further proceed.

I am taking advantage of dbs from step 1, which due to the fast restore capability I can run artisan migrate nearly every minute without the DB losing its original state, which is important for repeatable builds of course.

Once done, it will report back the time it took, which is a nifty indicator if the db migration is something heavy where a elevated error rate might be expected or not:

github build statuses

The console output of the job gives a little more indication of what is happening and why the build got triggered:


Setting up a proper db build pipeline and fully integrating it in our CI brought in the following goals:

  • full confidence in any database migrations being introduced
  • full visibility on the duration of database migrations as a “pre warning” on potential problems later on the production deploy
  • due to the usage of “dbs” (e.g. real restorable snapshots) this can be done cheaply and fast (3 min builds) even for larger databases (>10GB)


So curious: what problems did you have to solve for your database workflow / environment, and with what solutions did you come up with? 🙂

Killing MySQL Slow Queries with Xcache

I currently manage a high traffic Image Hoster with 10 million Page-Impressions per day causing high load on the Web Frontend Server and the DB Backend Server for some months now. My budget did not allow me to scale horizontally so I had to optimize the web application by killing a slow MySQL query with the usage of Xcache. Due to the website’s structure I was not able to use the Smarty Caching function as this would easily generate 2 million files and cause high disk i/o.

Pictures are just more than words.. so have a look at the screenshot of the MySQL-Server’s load before and after the optimizations (which went online on 6th October) – its like day and night 😉

Our Web Server uses Lighttpd 1.5 / SVN + PHP-FPM 5.3.3 (guys.. spawn-fgci is deprecated 😉 ) + Xcache (PHP Accelerator and varcache) to deliver static files and dynamic pages which connect to a separate MySQL 5.0 Server.

Unfortunately with a load of 5-10 (8 CPU cores) and 60-100% CPU usage on each core (!!) our MySQL Server was pretty much overloaded 😉 .

A big downside when having bottlenecks in your PHP-Script – usually caused when relying on external resources (like file_get_contents, cURL, massive non-asynchronous DNS-Lookups, MySQL queries, etc.) – is obviously the much higher execution time. This results in having a lot of PHP (or even worse Apache) processes being spawned or in use. You will easily get a over filled backlog or in worse case your Web server will start swapping – either way your website will slow down dramatically and you will lose a lot of visitors.

At first I checked the php-fpm.log.slow for scripts with too long execution times, just to make sure that this was not a PHP problem. There were a lot of scripts hanging during mysql_query() – so it was pretty clear where to look next.

Next I took a look at the MySQL Slow Query log and summarized queries which appeared most of the time. I was able to filter out the following query (simplified):

SELECT DISTINCT col1, col2, FROM table WHERE col3 = col4 AND id IN (SELECT id FROM table2 WHERE x = $variable) AND (SELECT id FROM table3 WHERE a = $variable) OR col5 = 1

A query with two sub-SELECTs and DISTICNT did not sound fast to me – especially not in that frequency it was requested – which was the key factor, as querying it on an empty MySQL-Server did not cause any problems .

Before putting the query into Xcache I checked all conditions and figured out that “… OR col5 = 1″ was never true, as currently no data had that value. I decided that if some feature / function based on that condition was not used since years, it will not be needed in future anyway, so I removed it.

Now I was finally ready for Xcache. This is only a very simple example how to cache individual SQL-Queries like

SELECT * FROM table WHERE name = '$variable'

in your PHP-Script:

if(xcache_isset("prefix_" . md5($variable)))
    $result = xcache_get("prefix_" . md5($variable));
    $result = mysql_query("SELECT * FROM table WHERE name = '$variable'");
    xcache_set("prefix_" . md5($variable), $result, (60 * 60 * 6));

So from now on, every SQL-Query will only be done once every 6 hours. Remember: we don’t want to fill up our Xcache for no reason.. so try only to SELECT columns which we really need. The biggest advantage with Xcache in contrary to other caching systems (e.g. Smarty Cache) – it has a garbage collector! So you don’t need to worry about zombie cache entries. Just try not to go out of memory, e.g. assign enough memory for your needs in the php.ini under the xcache section.

And set a reasonable time-to-live, not too short so enough data gets cached and load goes down, but not too long which could cause a too high memory usage.

Thats all 🙂

I was able to lower the load and CPU usage of our MySQL server by approx. 850%! How about you, were you able to optimize your website? Show off your awesomeness! I did by committing with following comment into svn 🙂