New Dynamic Hosting Infrastructure, Outages and Transparency

3 minute read

October 21, 2011

New Dynamic Hosting Infrastructure, Outages and Transparency

The result of our learnings is a new hosting platform that we have lovingly called “Infrastructure”, and we have begun gradually migrating all of our hosted customers over to this platform. Our goal is to make this transition as seamless as possible, with zero downtime.


Despite our best efforts, a handful of our customers have experienced 2 occurrences of downtime over the last 7 days:

Oct. 17, 5:30PM – 6:45PM
After moving a group of our customers to Infrastructure, we needed to make some DNS changes using our Rackspace Cloud DNS control panel. These changes should not have caused any downtime. Right after our lead engineer made the change, the Rackspace Cloud control panel became unresponsive and then went down completely. As a result, both his DNS changes and our old DNS changes were wiped out. He contacted Rackspace Cloud support to get everything back online, but the damage was done. Despite handling the problem within minutes of occurring, new DNS records take time to propogate, and it resulted in the downtime. It is incredibly unfortunate that our underlying hosting provider’s system failed us at such an inopportune time.

Oct. 18, 5:40PM – 5:45PM
On Infrastructure we were using a PHP module called APC to improve our performance. When the servers became unresponsive, we immediately investigated the issue and discovered that APC was causing our PHP processes to reload and refuse to take on new requests. We immediately removed the APC module from all of our web servers. Normally when a web-head experiences problems, our system can detect it and our load balancer will stop sending clients to that machine. In this case, because the error didn’t result in PHP processes failing – but instead resulted in creating PHP processes that just refused to do anything – clients were sent to the affected machine and received an error. Further investigations revealed that APC is not as reliable as we had been led to believe in our initial research and environment testing. It seems that APC relies on shared memory, and when we steadily added more requests than the system was used to handling, it began to fail. Thanks to the reporting built into Infrastructure, this outage was reported and handled within minutes of occurring.

We have removed the module from our servers and have increased the number of web servers that we use (almost doubled) in order to maintain availability for our customers.


Despite these unfortunate problems, we remain incredibly excited about our new hosting infrastructure. Over the next six weeks we will continue to migrate our hosting customers at every level onto the new system. Once the migration is complete, we have built a page that will be available for all of our customers that reports the status of our network in real time, including a history of any outages that have occurred.

We need our customers to know that availability and speed are a core feature of our service at, so we are opening up the doors and windows, letting people inside the house to see what makes this company great.


Share Your Thoughts

Your email address will not be published. Required fields are marked *

Mark O'Sullivan

Written by Mark O'Sullivan

Have an Article for Vanilla's Blog?

Send us an email to [email protected] with your topic idea and we'll circle back with our publishing guidelines.

Subscribe to the Community Corner Newsletter and get expert insight and analysis on how to get the most out of your online community every Friday.
[contact-form-7 id="5700" title="Newsletter Form"]

Request a Demo

Schedule a product demo now.

Contact Us