// you’re reading...

Filed Under: Cloud Computing | Technology

In the aftermath of Intermedia’s extended outage, an important lesson to be learned for SAS providers

As a current (and reasonably long time) customer of SAS Exchange hosting provider Intermedia.com, we at OleOle were naturally affected to some extent by Intermedia’s extended system outage on March 5th, 2010. For pretty much the entire morning on that day, we, along with thousands of their other customers, had zero email capability, no sending, no receiving, zilch.

To make matters worse, during a large part of this outage, Intermedia’s own website was unavailable, so affected customers could not even go onto the Intermedia website to check for status updates or open support tickets. Needless to say, their PBX was being bombarded by thousands of irate customers as well, so getting someone on the line for an update wasn’t that easy either. Twitter ended up being the best source of updates, first from other customers who tweeted what info they could glean, and then later from Intermedia’s own Twitter account when they managed to get more caught up and started giving out some official updates.

Today I received their formal RFO (Reasons for Outage) letter via email which goes into great details describing why this outage occurred and what steps they are taking to try to prevent a re-occurrence for the same reasons in future. In a nutshell, there was a hardware failure in one of their EMC SAN devices, and this failure occurred in such a way that prevented the device’s own in-built fault tolerance mechanisms from allowing the SAN to effectively remain “up” – that is, they are saying this is one of those failures that should not have happened. These devices are designed precisely NOT to fail under such circumstances, but nonetheless it did fail.

Intermedia’s letter goes on to describe the actions they are taking along with the hardware vendor to guard against this in future. All very good and well. Now on to the little gem in the letter that I found the most surprising, and from which all technologists with “uptime” responsibility for Software as a Service (SAS) systems would do well to learn from.

Here’s the bit that really caught my attention:

“During the event, our ability to communicate status effectively was hindered by an outage of our corporate communication tools until 9:50 a.m. PST. The databases for www.Intermedia.net, Intermedia’s client control panel and Intermedia’s trouble ticket system were located on the affected SAN and therefore were not available during the SAN event. These systems were restored as soon as the SAN performance issue was resolved. All available personnel were directed to answer incoming customer calls. Intermedia logged over 2,000 incoming calls to our PBX and effectively answered more than 1,000 of those calls.”

In hindsight it seems pretty obvious doesn’t it? Why locate your “corporate communications tools” and “trouble ticket system” on the same infrastructure as the core service that you provide? In this case, it might have been the thinking that the EMC SAN just couldn’t possibly fail as it was inherently designed to be fault tolerant, and indeed, EMC SANs are extremely heavy duty devices with very good track records for what they do. Yet fail it did and with it, came down key parts of the foundation, all in one go. Or maybe it was to save on costs. Or maybe it was just a careless oversight. We don’t really know why, but we do know it was implemented that way and that it was clearly a flawed design decision.

Naturally, Intermedia themselves now intend to fix this:

“As a high priority for completion, no later than Q2, Intermedia will also be isolating corporate communication infrastructure from the same infrastructure that provides our Exchange services, guaranteeing that we will be able to communicate effectively with clients at all times during a service interruption. “

Again, this revision might seem like the obvious system and network design that should have been implemented from the get-go, especially for a SAS provider as long in the business and as large as Intermedia. But yet it was not done as such, and it took an outage on this scale to force a change that now seems so obvious in hindsight. Certainly a lesson we can all learn from.

Discussion

12 comments for “In the aftermath of Intermedia’s extended outage, an important lesson to be learned for SAS providers”

  1. Looks like they are down again for past 15 hours. No service, all email down.

    Posted by badservice | March 13, 2010, 5:56 am
  2. Another great article David. A valuable lesson indeed.

    Posted by Todd Z | March 27, 2010, 1:55 am
  3. And yet down again today. Intermedia has lost sight of its core mission, and treats customers like irate “in-house” users. Problem is, irate “in-house” users don’t have a choice. We do. We’re moving some of our clients away from Intermedia.

    Posted by Rafi Kronzon | April 15, 2010, 2:52 pm
  4. We have completely been without email since today’s Intermedia fiasco began around 9 a.m. Eastern this morning. It is 8 p.m. and still no mail is coming or going.

    Posted by Eric Helmuth | April 15, 2010, 4:47 pm
  5. Still MAJOR PROBLEM.
    24 hours and email is not going in or out properly.
    As mentioned above most frustrating part is can’t reach anyone and if you don’t pay extra for 24/7 they won’t even answer the phone until 9 AM NY time.
    As IT manager my last 24 hours have been difficult to say the least.

    Posted by Lissa Halen | April 16, 2010, 5:24 am
  6. wow! my sympathies with those affected by this current outage – I know how frustrating this can be. It does appear though that it was not system wide as thankfully (for us!) Exchange services at my company have been fine during this time.

    Posted by David | April 16, 2010, 8:09 am
  7. They are down hard – still, over 2 days. Effectively taking us out of business.

    Posted by bitfarmer | April 16, 2010, 8:21 am
  8. yikes! 2 days running is totally unacceptable.

    Posted by David | April 16, 2010, 8:31 am
  9. Yes, at present it’s only one server cluster that’s down- EXCH020 – but it is down HARD and of course that’s the one we’re on. They had to stop accepting inbound mail connections. 2nd day without business email. Charming.

    Posted by Eric | April 16, 2010, 9:49 am
  10. *Disclosure I Work for an EMC Competitor*

    To be fair to EMC, the system in the initial failure did continue to operate but at a degraded service level. This is a characteristic of all modular storage arrays that work in high availability pairs. The level of degradation depends on the make of the array and the settings used, and it appears that Expedia took the option of choosing a configuration that put data preservation ahead of performance. Unfortunately this choice which is understandable had unpleasant and unforseen consequences.

    What surprises me is that the impact of failover either wasnt tested (which is really hard to do in a multi-tenant environment), or mitigated by only using half the possible performance of the array.

    Having said that, even when only using half the possible performance on a mid-range arrays, maintaining that performance when half the array fails requires that write caching remains on, which is a dangerous practice because if the second controller fails, cached data can be lost.

    If I were intermedia, I’d seriously consider looking for a SAN that is either easier to test under failure conditions, or doesnt lose the majority of its performance when one of the components fails.

    Given the additional outages that have happened since, I’d expect that they’d be doing that as a matter of urgency.

    Posted by John Martin | April 27, 2010, 3:58 am
  11. Hi David,

    How’s life?

    Heard you left OleOle? Is everything going well with you.

    Bo

    Posted by Bo | May 6, 2010, 8:56 pm
  12. maybe off topic here, but does anyone have a good solution to move from intermedia? I am tired of answering the question, what is wrong with our email now. I have installed microsoft exchange 2010 and have actually started migrating people, but the task of downloading the pst files from intermedia is no fun.

    Posted by Chris | May 24, 2010, 6:28 pm

Post a comment