This is an experiment in cross-posting something from my internal blog with my employer Atlassian. I thought it may be be of general interest so I’ve edited it a bit and shared it here.
We recently did a regular zero downtime upgrade of our order processing system. This is a fairly routine event, we use Bamboo deployments to push from git master to our dev, staging and then production environments every day. The vast majority of these upgrades go smoothly without serious regressions or problems. On this particular day the application did not come back properly, we had an outage! We treat these very seriously because the ordering system is a big monolithic do-all-the-things type of application. An outage means customers cannot make new purchases, view their license keys, sign up for OnDemand, view prices on Marketplace and many other things.