And how it could have been avoided:

Heathcare.gov. Argh. The events related to this website were painful to watch unfold. It filled our newsfeed with the colossal fail that is “online enrollment.”

“Fail” because the website has not only been critiqued for its difficulty in use, but that it crashed.

Crashed – as in seriously crashed. How in this day and age does a high-profile site like this crash?!

Not crash like a little blip for a minute or two, but a complete crash for hours on end. And it could have easily been avoided. We’ll tell you how.

data center failure is embarrassing and inconvenientWhat we’ve been told per several online reports is that the data center hosting HealthCare.gov, Verizon Terremark, experienced an “outage.” We can only assume this “outage” is a glaring screw-up, and the likely product of naïve IT operators believing that because they’re in a data center with redundancy, they’re protected from a crash. How wrong they are!

As a website operator, it’s elementary DR planning to assess the cost of downtime vs the cost of a failover solution, and to put into place automatic measures designed to mitigate any problems. Putting all your eggs in one data center basket has just been demonstrated to the whole country as a very bad idea.

If its sound like we’re being harsh in our critique, it’s because this is our area of expertise: We help companies avoid painfully stupid events like this. How often do you see high profile websites like American Airlines or Walgreens go dark?

In this day and age, Disaster Recovery 101 always includes a failover system that triggers into effect the moment there’s an issue, so that the user’s experience is uninterrupted and business continuity is preserved. (And the media doesn’t have a field day.)

The thing that really gets us about this hot mess is that it’s not even that expensive to accommodate failover capabilities for the HealthCare.gov infrastructure – the web server is simply just not that big.

While we don’t anticipate winning the business of healthcare.gov for a DR solution, we could certainly handle it and — here’s what we’d do to make sure millions of Americans actually connected to the site when they discovered that their healthcare had been cancelled:

1)   Set up an active- active failover scenario — which is exactly what mega sites like airlines or valuable ecommerce sites do.

2)   Set up a load balancing scenario between multiple data centers so if one of your data centers has a problem, your site can be served from your redundant data center.

Bottom line: With appropriate disaster recovery pre-planning that included a failover solution, Healthcare.gov would never have been affected by the failure of a single datacenter, and we could all get on with the news about what should be real world issues.