Yesterday we suffered a 1 hour outage across our hosted network and I'd like to take a minute to dive into the root cause.
Over the last week or so we have been working on aggregating all of Vanilla's access, error, and security logs and shipping them to a central, searchable storage location. After testing the finalized approach on our staging systems without incident, we pushed the functionality out to our shared clusters to start gathering real production data.
Vanilla uses a managed hosting provider - that is, they maintain the hardware and some of the software that is used allocate resources across those physical servers. Unbeknownst to us, the logging activity was generating a large number of DNS lookup queries due to the fact that each log message is transmitted separately to the central repository. During testing on staging clusters, the volume was still acceptably low, but once the system entered production the number of queries per second passed a threshold in our datacenter and our hosting provider took notice.
Our hosting provider, seeing an increase in DNS queries, immediately, punitively, rate limited our environment without any warning to us. In our opinion, this was a gross overreaction. Since many parts of the Vanilla service use DNS as part of their normal operations, this caused the outage. We reacted within a minute and started to troubleshoot the problem, but it took a rather long time to get to the bottom of what was happening.
Ultimately, we are responsible for ensuring that our service functions without interruption. In response to this, we have implemented some technical and policy changes.
1. On the technical side, we've installed local DNS caching agents throughout our environment. These agents will prevent common DNS queries from being re-executed over and over, and will also speed up DNS resolution in general, which should improve performance.
2. Policy wise, we've started a discussion with our hosting provider to ensure that our environment is never punitively restricted again without due notice to Vanilla Operations. This will be non negotiable going forward, and we will make whatever changes are needed to ensure that we obtain this assurance.
3. We have started working on a way to take more direct control, and better oversight where control is not possible, over the physical parts of our network so that we can see the impact of changes like this before they get out of hand.
It is a frustrating irony that in our quest to deploy tools that would allow us to gain deeper insight into our environment, we suffered an outage due to a lack of insight.
You can always stay up to date on the status of the Vanilla network by visiting http://status.vanillaforums.com