Our New Dynamic Hosting Infrastructure
Over the last 24 months we have learned a lot about scaling high-availability, highly un-cachable dynamic content. As our customer-base has grown, and we've taken on communities that get as much as ten million page views per month, we've made leaps and bounds in both architecting our platform to handle that scale, and also how to construct a dynamic cloud-based solution for handling that load.
The result of our learnings is a new hosting platform that we have lovingly called "Infrastructure", and we have begun gradually migrating all of our hosted customers over to this platform. Our goal is to make this transition as seamless as possible, with zero downtime.
Despite our best efforts, a handful of our customers have experienced 2 occurrences of downtime over the last 7 days:
Oct. 17, 5:30PM - 6:45PM
After moving a group of our customers to Infrastructure, we needed to make some DNS changes using our Rackspace Cloud DNS control panel. These changes should not have caused any downtime. Right after our lead engineer made the change, the Rackspace Cloud control panel became unresponsive and then went down completely. As a result, both his DNS changes and our old DNS changes were wiped out. He contacted Rackspace Cloud support to get everything back online, but the damage was done. Despite handling the problem within minutes of occurring, new DNS records take time to propogate, and it resulted in the downtime. It is incredibly unfortunate that our underlying hosting provider's system failed us at such an inopportune time.
Oct. 18, 5:40PM - 5:45PM
On Infrastructure we were using a PHP module called APC to improve our performance. When the servers became unresponsive, we immediately investigated the issue and discovered that APC was causing our PHP processes to reload and refuse to take on new requests. We immediately removed the APC module from all of our web servers. Normally when a web-head experiences problems, our system can detect it and our load balancer will stop sending clients to that machine. In this case, because the error didn't result in PHP processes failing - but instead resulted in creating PHP processes that just refused to do anything - clients were sent to the affected machine and received an error. Further investigations revealed that APC is not as reliable as we had been led to believe in our initial research and environment testing. It seems that APC relies on shared memory, and when we steadily added more requests than the system was used to handling, it began to fail. Thanks to the reporting built into Infrastructure, this outage was reported and handled within minutes of occurring.
We have removed the module from our servers and have increased the number of web servers that we use (almost doubled) in order to maintain availability for our customers.
Despite these unfortunate problems, we remain incredibly excited about our new hosting infrastructure. Over the next six weeks we will continue to migrate our hosting customers at every level onto the new system. Once the migration is complete, we have built a page that will be available for all of our customers that reports the status of our network in real time, including a history of any outages that have occurred.
We need our customers to know that availability and speed are a core feature of our service at VanillaForums.com, so we are opening up the doors and windows, letting people inside the house to see what makes this company great.