And how it could have been avoided:
Heathcare.gov. Argh. The events related to this website were painful to watch unfold. It filled our newsfeed with the colossal fail that is “online enrollment.”
“Fail” because the website has not only been critiqued for its difficulty in use, but that it crashed.
Crashed – as in seriously crashed. How in this day and age does a high-profile site like this crash?!
Not crash like a little blip for a minute or two, but a complete crash for hours on end. And it could have easily been avoided. We’ll tell you how.
What we’ve been told per several online reports is that the data center hosting HealthCare.gov, Verizon Terremark, experienced an “outage.” We can only assume this “outage” is a glaring screw-up, and the likely product of naïve IT operators believing that because they’re in a data center with redundancy, they’re protected from a crash. How wrong they are!
As a website operator, it’s elementary DR planning to assess the cost of downtime vs the cost of a failover solution, and to put into place automatic measures designed to mitigate any problems. Putting all your eggs in one data center basket has just been demonstrated to the whole country as a very bad idea.
If its sound like we’re being harsh in our critique, it’s because this is our area of expertise: We help companies avoid painfully stupid events like this. How often do you see high profile websites like American Airlines or Walgreens go dark?
In this day and age, Disaster Recovery 101 always includes a failover system that triggers into effect the moment there’s an issue, so that the user’s experience is uninterrupted and business continuity is preserved. (And the media doesn’t have a field day.)
The thing that really gets us about this hot mess is that it’s not even that expensive to accommodate failover capabilities for the HealthCare.gov infrastructure – the web server is simply just not that big.
While we don’t anticipate winning the business of healthcare.gov for a DR solution, we could certainly handle it and — here’s what we’d do to make sure millions of Americans actually connected to the site when they discovered that their healthcare had been cancelled:
1) Set up an active- active failover scenario — which is exactly what mega sites like airlines or valuable ecommerce sites do.
2) Set up a load balancing scenario between multiple data centers so if one of your data centers has a problem, your site can be served from your redundant data center.
Bottom line: With appropriate disaster recovery pre-planning that included a failover solution, Healthcare.gov would never have been affected by the failure of a single datacenter, and we could all get on with the news about what should be real world issues.
DALLAS, TX – Global Data Vault today announced the expansion of its data center locations to include state of the art ViaWest’s Lone Mountain facility in Las Vegas, Nevada.
In the world of cloud storage and disaster recovery, protecting data in the most technologically secure of environments is mission-critical. Global Data Vault sought a partner who could provide a high-level of redundancy, a direct fiber link to Dallas, and a location with extremely low risk of natural disasters.
They found that in ViaWest, with their highly secure, fault-tolerant Lone Mountain data center.This North Las Vegas data center met the demands of Global Data Vault with the industry’s first ever Tier IV design certification from the Uptime Institute, the leading education and standards organizational authority. That prestigious certification recognizes the Lone Mountain data center as the premier colocation facility in North America.
The biggest challenge facing data centers today is the growing amounts of data that large enterprises create, maintain and store in the cloud. A 2010 Gartner study showed that 47% of respondents ranked data growth in their top three challenges, 62% said they were expanding hardware capacity at existing data centers and 30% planned to build entirely new data center — and that was almost 3 years ago! Imagine the exponential growth of data today as we continue to generate even more massive storage requirements.
As your data expands, moving it from your location to the cloud becomes a parallel problem. Having a faster connection is of paramount concern as your data can literally choke on an inadequate connection. Even having a 100 megabit fiber connection can improve your data protection capabilities significantly, not to mention speed up other business processes. (more…)
A fact: More and more companies are putting more and more data in to the cloud. Our appetite for data is growing exponentially and that leads to more dependency on cloud service providers. However safe they may seem, it’s short sighted to presume that your cloud provider is disaster-proof as well.
The Cloud has become a nearly mainstream way for companies to manage their data and utilize effective data backup and recovery systems. But the cloud is only as good as the data center that it’s housed in, and as we’ve discussed previously, there’s plenty of variation among data center design that can impact the security and efficiency of your cloud backup or hosting continuity. (more…)
A chassis is a pretty big deal around here but not the kind that keeps the body, suspension and wheels of your car all attached together. In our world, a chassis is the enclosure that handles all the non-computing tasks required to support multiple servers, including providing power, cooling, connectivity and manageability to each blade server that it’s holding.
“Blades” are redundant self-contained servers that fit into a chassis with other blades. Each chassis holds 8 to 16 blades – so that’s 16 to 32 processors and up to 96 cores per blade. Each blade supports up to 48GB of RAM or up to 768GB per chassis.
Our latest data center upgrade involves multiple redundant chassis. Here is a typical HP Blade Chassis – this one with 16 blades:
Blades are pretty cool servers. Their major selling point is that they afford nearly 100% uptime. They can tackle any task you’d like them to:
- Database and application hosting
- Virtual server hosting platforms
- File sharing
- Remote desktops and workstations
- Web page serving and caching
- Streaming audio and video content and more.
If your system needs more power, you just add another blade server to your chassis. They provide reliability through resilience and quality.
Servers can fail, we all know that. But when using VMWare, you can cluster 8 – 16 servers together in a single chassis. By doing this, when a server fails, it’s has almost negligible impact because the workload moves from all 8 for example to 7. None of the servers are running at full capacity so if one fails, it’s no problem.
The chassis itself has a high degree of resiliency to it. It has 4 power supplies (rather than the usual two of a stand-alone server). These power supplies are basically mini-transformers with their own fans. If one fails, as they sometimes do, it can be replaced within 24 hours without impact. And that’s a good thing.
All disk arrays – (storage systems linking multiple hard drives into one large drive) have at least 2 power supplies but could run off of one. Having two power supplies provides protection from 2 kinds of problems:
1) If the power supply stops working, another one takes over and handles the power for both
2) By having separate power paths, one side of device is directed into one path, the other in another separate path. If your data center is designed correctly, you are plugging into distinctly different power feeds. So if your electric company has a power failure, or a transformer blows up, or a major wire gets cut, the data center will stay lit. Even if one side of the power in the data center goes down, the power to your servers stay on and you don’t even go to diesel. However some data centers (not ours) do not have 2 power station feeds. You would have multiple concurrent power paths into your space.
There’s a saying, “Murphy’s Law: What can go wrong, will. Bell’s Law: Murphy was an optimist.”
Yeah, we all feel that way on certain days when your planets are aligned for great misfortune, but it’s also an apropos saying for all things regarding your computer systems and even your data center maintainability. Things are going to go wrong. Sooner or later, they will.
Change happens. New equipment, increased power requirements, cooling demands, changes in safety and security regulations, consolidations, expansions. All of these change events can trigger a failure. They demand that you have flexible maintainability, because with each change event, there’s a potential for misfortune.
The good news is that you can mitigate your risk for a Bell’s Law type of SNAFU by taking precautions. All experts recommend a few steps you can take to avoid downtime, as well as choosing a data center partner that fits your budget and needs.
Number one on your to-do list is to avoid densely packing racks with energy hogs. Next you should be trading space for density, as energy costs are 4 — sometimes 5 times — the cost of space. (Aim for 4 kilowatts per rack.) But after you’ve done your part to ensure the best continuity for your own servers, what do you know about the Data Center that you choose to put your servers in? How do you reduce your chances of performance interruptions? Choosing a data center with a level of uptime consistent with your needs is a start.
The aptly named “Uptime Institute” (a non-profit organization) is an unbiased, third-party data center research, education, and consulting organization focused on improving data center performance and efficiency through collaboration and innovation. Members of the Uptime Institute are corporations with heavy data center utilizations. These corporate members share best practices in relation to high performance data centers, and through their discussions, they’ve identified four “tiers” of fault tolerance (the Bell’s Law thing again), where 1 is the lowest and 4 is the highest or best in regards to data center uptime. Below is their definition of Tiers 1 – 4 and their typical outage time on an annual basis, as well as their basic design criteria:
||One path to power and coolant. Does not have redundant components (spare air conditioning units)
||One path to power and coolant, but has redundant components for both.
||Multiple power and coolant distribution paths, but only one active path. If the active path fails, the data center can switch over to the redundant path.
||24 minutes (this equates to essentially 99.995% availability
||The data center has redundant paths, but also adds “fault tolerance” which means that if one path fails, the other automatically takes over, including everything from the electrical power distribution system, the interruptible power supply (UPS), back-up diesel generation, etc.
Bell’s Law happens. Do your part to ensure you aren’t set up for colossal failure – and choose a data center partner that fits your needs and budget. Knowing your parameters allows you to plan for disasters much easier, and the contingency plans for recovery.