What role does your database play in your CI/CD?
How long does it take for your devs to get a running database?
How long does it take to recover a dev-database in case of an accidentally destruction?
How current are the database snapshots your devs use?
How confident are you with schema updates going to production?
These days DevOps conferences and talks are filled up with containerisations, docker, k8s, auto-scaling, auto-healing, ci/cd, agile, etc – disappointingly however, most of them only touch stateless environments and far too seldom do engineers share their knowledge on running a database in a CI/CD environment & workflow.
In this blog article I will give some info on how we solved it at my current place – which mainly consists of running a web-application based on Laravel on a traditional “LAMP”-stack.
I’ll be honest, we had a rough start. We used to have a shared AWS RDS for our QA & Staging environments which was also then be used by developers to connect their local workspace to that remote MySQL instance and be able to view the webapp locally with proper, non-seeded data – which sometimes is also just essential to debug and fix certain types of reported bugs.
So, our current state kinda worked, but was super unreliable. It was enough for a dev to accidentally drop the database or a botched staging deploy to suddenly kill the workflow of whole team. Restoring took over an hour (mix of larger-than-your-usual-wordpress-database and budget-restrictions-on-dev-instances).
Obviously this was super annoying and had to change. So I went over to my friends at the ZATech Slack channel but I was quickly hitting a wall, and in contrary it seems like I stepped on some people toes: I learned my lesson, never mention “on-premise” nearby a DevOps engineer (causes hefty allergic reactions)…
Basically the following two statements were made:
- everything is in the cloud, no on-premise or no local databases
- DB should be part of the CI/CD
It was difficult for me to agree on the first point, being based in South Africa there are absolutely no proper cloud providers – next hop is AWS London. And anyone who has ever connected his local webapp to a remote MySQL knows how quick a higher latency (>10ms) can make working locally a pita.
While I do agree that the DB should be part of the CI/CD, there is still a huge benefit (especially in efficiency and speed) when developing locally – and also not having to rely on seeded data.
Disappointed of no solutions I decided to go on my own against all odds, and with the support of our CTOs + wonderful person in our finance to allocate some budget for on-premise hardware (specs for the geeks like me: i7-6700 / 64GB RAM / 4x 256GB SSD @RAID10 / UPS).
Step 1: create a database service
We will use the database service to actual host the databases. I use Jenkins to nightly run a simple downstream job that
mysqldump‘s production database (it ignores some larger tables that are not needed), anonymises the data (emails + mobile-numbers), pushes the dump to a predictable location (which is accessible internally by devs).
From there, the database service will launch three (one shared amongst devs, one for experimental tests cases / usage, one for our automated builds – see step 2) VMs that have MySQL running on them, import the above dump, then create a snapshot of the storage drive. I use Virtualbox as I had extensive experiencing using it in a programmatic way, but if I’d redo the architecture I would most probably do it with libvirt/qemu.
I created a small web interface as well:
With database services (dbs) the following goals have been achieved:
- a developer has access to an anonymised production database that is never older than 24hrs
- the dev can either download the dump and run it on its local machine, or directly connect to “dbs” (database services) – which will be especially fast from within the office
- due to the usage of snapshots, should anything happen to the database it is possible to restore the state of last night in less than a minute (!!) – which is much faster than any AWS RDS snapshot restore and it does not involve any config changes (e.g. in-place restore)
- Staging & QA still use a shared DB in the cloud, however due to the separation, issues on either side do not interfere with the whole team
Dbs has been running for quite a while and it solved a good amount of issues. However we were still getting the occasional botched staging deploy or failed master-build due to us only running a very optimistic/superficial check on database migrations.
This is due to us only running
artisan migrate (laravel.com/docs/5.4/migrations) against a empty database in our CI builds (for predictability reasons). Meaning, builds would only fail if there was a PHP or SQL syntax error, not if the migration itself were faulty on production data. The easiest way to demonstrate a fail would be to add a unique-index on a column – perfectly fine on a empty database, not so much on production with potential duplicate values already existing.
Step 2: run builds against prod data
The safest way to make sure that your database migrations are sound & proof is to actually run it against production data, as that is what will be ultimately the case on a production deploy anyways.
Fortunately we do not need to run every build against prod snapshot, as we are only interested if anything within the
/database/migrations/ folder changes.
I created an additional Jenkins job that runs on every PR and with the help of a little bash + the Github API, I can check if a migration was actually part of the code changes or not, and only then will the build further proceed.
I am taking advantage of dbs from step 1, which due to the fast restore capability I can run
artisan migrate nearly every minute without the DB losing its original state, which is important for repeatable builds of course.
Once done, it will report back the time it took, which is a nifty indicator if the db migration is something heavy where a elevated error rate might be expected or not:
The console output of the job gives a little more indication of what is happening and why the build got triggered:
Setting up a proper db build pipeline and fully integrating it in our CI brought in the following goals:
- full confidence in any database migrations being introduced
- full visibility on the duration of database migrations as a “pre warning” on potential problems later on the production deploy
- due to the usage of “dbs” (e.g. real restorable snapshots) this can be done cheaply and fast (3 min builds) even for larger databases (>10GB)
So curious: what problems did you have to solve for your database workflow / environment, and with what solutions did you come up with? 🙂