Below is a lightly edited transcription of our recent webinar:
Disaster recovery really isn’t fun, right? It’s not something that anybody ever wants to have to go through, but unfortunately, a lot of businesses do. We’ve had discussions before about what to do to prepare for it, and we’re going to touch on those again.
It’s very important to do a business impact analysis. Basically, what that does is it allows you to go through and look at your systems, look at your workflows, your applications, and what drives your business. It allows your business to operate and rate each aspect by the impact that it would have if there was an outage. If your services are unavailable to your customers or your employees, how does that look financially? Does it drive your customers and employees to competitors? Does it impact the reputation or brand of your business? Does it hurt your brand and your image for your employees? Do they feel uncertain about the ability of their employer to maintain a reliable workplace? There’s a lot to consider when you talk about this. The business impact analysis is one of those time-consuming processes that will be well worth it in the end. It should be part of every business continuity and disaster recovery plan.
The one thing that you do need to have is a BCDR or “business continuity disaster recovery plan.” In addition to that, you need to have a runbook or a playbook on what to do during the business continuity disaster recovery. You’ll also need to know who your cyber insurance is, and also how to contact them because cyber insurance is like any other insurance company: They want to come out and assess the damage. For example, if you have a car wreck, the first thing your insurance company is going to want you to do is to call them. If you get infected with malware, the first thing your insurance company is going to want you to do is to call them. They are normally going to put a freeze on your internal environment so that they can inspect the damage, find out what extent it is, and what all has been affected.
While they’re doing that, you need to contact law enforcement to let them know that you had a ransomware event, malware event, or something malicious in your environment. You can report that to your local FBI field office, or you can go to CISA.gov to submit an online report. They also have a phone number that you can call to report any of those ransomware malware events immediately.
Another thing that you need to do is maintain hardware and software best practices. What that means is that you need to make sure your versioning is patched, you’re up to date on your software, you reboot your servers. Any hardware upgrades? Make sure that you have capacity, which is very important for your servers.
One common thing that we experience when we bring customers up is that they need a number of windows updates. We have to make sure that the servers are patched and rebooted. Just like if you reboot your computer locally, we have to wait for the windows update process to get it online. That’s something we can’t interrupt, and it takes time. Making sure your software’s up to date, you’re on the correct version, you patch your servers, you reboot them… those housekeeping tasks should be in order.
Another thing is you’re going to need to do regularly is to test your backups. If you don’t test it, you don’t want come into a DR event and find out that that server does not work, or that we can’t get the application working on another server ot that needs a public IP address. You don’t want to come into these issues during a Dr event.
That’s why it is important to test the server, spin it up just as it is a real DR event and make sure that you also test all of your applications from a server-level, and test from a workstation. When we perform a DR test for our customers, I always advise them to test it exactly as they’re connecting to their VPN. The only difference is, is you’re going to be connecting to our VPN or through your firewall that you put in one of our data centers. It’s important to connect to that test everything just like you were to run it in production.
Yeah, that’s, that’s a really good point. And you know, that slide also talked about the availability gap, which is about expectations versus reality. It’s basically what you think your systems or your applications and their recoverability is versus what it really is. How quickly can you get things up? How quickly can your services be delivered and returned to operations versus the reality of that? We hope to get that as close to zero downtime as possible, but we find many times, the customers are at risk because they have an availability gap that they didn’t realize existed. And like Steven mentioned, the testing of all of that can help you understand that gap a bit better, and you have an opportunity to tweak anything that you need before an actual disaster occurs.
Now we’re going to get into some of the stories here, and unfortunately, all of these are true. There’s nothing made up here, nothing is embellished, but we have left the company names off for confidentiality.
These are actual calls that we’ve taken for customers that have hit by some type of disaster, whether it’s malware or natural disaster or a non-cyber type of disaster.
Conti Ransomware Recovery
Example 1: This particular company is more than a hundred years old, they have 250 employees, and they were infected with Conti ransomware. Our on-call service was alerted at five o’clock in the morning as the IT manager was en route to the office. The customer had some weird things going on. They said that people weren’t able to get on the internet. They started seeing some funny things on their systems. Somebody tried to log in remotely and wasn’t unable to, and that’s when they really knew something really wrong was happening. Steven, can you walk us through what happened from the time that we got the call until we were able to get them up and running?
They weren’t sure exactly what the threat was yet. So as soon as they called us, we went ahead and engaged our spin up process, and immediately started using their “EDP” or enhanced data protection to fast clone the data back into their repository.
The first thing that we do whenever you have a DR event is we disable your tenant. The reason for that is because we want to stop any activity that is happening on the cloud side. By disabling that tenant, it doesn’t necessarily stop the deletions, but it stops any additional deletions. Keep in mind that this process happens within seconds.
By the time the customer called us for help, and we checked and disabled their tenant, and also checked their repository, the data was already deleted by the ransomware. So at that point, we went ahead and initiated our DR to spin up process. We fast-cloned their servers back into the repository. And then once we did that, we did another fast clone to go ahead and begin the process of getting them up. So normally from the time that you call in, until the time that we start spinning up, it is all automated. It’s all done in the backend by proprietary technology to get this done quickly as possible. They were really distressed because they couldn’t issue payroll and the employees were to be paid the next day.
We spun them up and got them connected via the VPN. Then it was determined that they needed external IP addresses to connect to an internal server to issue the payroll. Once we got that access, we did all the natting and the firewalls because we don’t open everything. We only open what you tell us. Whenever you need an application that requires an external IP address, we will ask you what ports are required. This scenario only required HTTP – HTTPS wasn’t an issue. We opened those ports. We verified that they could access the server, but we had a very tight window during this recovery. We had to have payroll done by noon. We had a three-hour window to recover this site, get it up, get it functioning, get the domain controllers communicating with everything. We did make the window and the employees got paid! The takeaway is that it’s extremely important that you communicate any needs – such as an external IP — prior to the DR event, as those can pose additional steps that can delay the recovery process.
There were some interesting challenges with this one, too, right? Steven and I were on the calls. We set up a bridge line and we were taking shifts, but at one point we were on the call together. We were talking to the point of contact for the company and he was in a conference room. You could hear other people talking in the background and he kept saying, “hang on a second. The cybersecurity people, the lawyers or the investigators are here and they need to talk to me.” So he’d have to go to talk with various people around the office, either as employees or the cybersecurity folks or whoever they were. The lesson here is that it’s important to make sure that you’ve got the appropriate people — not just one person — lined up to have those conversations. If you are the director, the VP or the CIO or whoever it is, make sure that you have somebody designated to talk to us and somebody to handle your business. When you start confusing those lines — and this really goes for anything with business continuity and cyber response — make sure that you have your activities delegated and lined out appropriately so that everybody knows who’s doing what and everything can be handled efficiently.
Another curve ball with this recovery was that this customer also had some rather large virtual hard drives that were outside of the normal best practice. We were able to recover them, but it gave us a challenge as it was time-consuming to get that data back down to the customer. In a typical situation, If you fail over to GDV and then spin up, everything happens as it should but we still have to get all the data that changes back down to your systems after you get them rebuilt or, or new systems in place for whatever it is. In this scenario, it took a really long time for us to get that data back to them because they had quite large hard drives that may have been out of the best practice measurement. The lesson here is that as you’re designing your environment, make sure that you’re taking that into consideration. How are those things set up and shared because ultimately you may have to recover them.
Imagine you are in charge of IT, and a phone call wakes you up in the middle of the night. It’s not ransomware. It’s not a hurricane, it’s that your building is literally burning down. That’s a pretty harsh call. That would be a tough call to take. I mean, just imagine if you tried to go to the place where you worked, if you’re one of the employees who wasn’t on the call list, or maybe you hadn’t heard the news and you go to your place of work and there are just fire trucks everywhere. And the building is in ashes. That would be pretty crazy to deal with, especially knowing that 40% of businesses do not reopen after a fire disaster. That’s true even more so for smaller businesses that may only have one location, fewer employees, and businesses that don’t have the sustainability of some larger organizations.
In this particular case study, it is the story of a Midwestern manufacturing company which was well-established. Their manufacturing was in this building that caught fire, their data center was in this building where they had the fire and it all was destroyed. We don’t know the exact details of how it started, but they called us with basically nowhere to go, the servers were inaccessible. They may have been melted.
This customer’s worst nightmare came true that night. They got that phone call at three o’clock in the morning. One of their machines had an electrical issue, it started a fire that burned an entire building down, including all server computers. Nothing was left. The customer called us at about 6:00 AM, letting us know that it destroyed their entire building, including their internal infrastructure. Our support team went into action for what they are trained to do every day. They spun the customer up again with our automated scripts that we run in the background. We had this customer up within SLA and they never closed their doors. I mean, how impressive is that?
Whenever you have a disaster of this nature, you have to understand that your equipment is gone, your switches, your sans, your servers, none of it exists anymore. So what we do on the BCDR team is we spin it up into our infrastructure, provide you with VPN access, and then your customers would connect via the VPN into our infrastructure, and it’s basically business as usual.
One thing that you have to remember is that after the fire, and after you locate a new building, after you get new equipment, at that point, we work with you, or in this case, the manufacturing business, to get the data filled back into their new infrastructure. Those a hundred employees still have a job today because we were able to get them up and running. We were there to help this customer so that there was no impact to their business and they continued to operate today!
This is something that businesses struggle with sometimes is, “how do I guarantee that I can keep my services running without having to maintain a second location?” When you think of old school DR, you’d have to have a second location, you’d have to have servers or very similar equipment in a second location. You’d either have to take backup tapes to that location and be prepared to do resources that way — which could take forever assuming that they actually worked on tape — you’d have to maintain the cost of all that, plus keep it updated and patched and everything else. Fast forward to present day and the disaster recovery as a service and the cloud solution that we offer is a great alternative, because it just folds into your OPEX costs of your business. You don’t have to worry about anything from a hardware and software perspective at that point.
Something Lurking in the Shadows
The last example we have is of unmitigated issues, something hiding in the shadows. This is where patching comes into play: having some type of patch, status patch, scanning patch management. From both an operating system and a cybersecurity level, something should be monitoring and checking for vulnerabilities.
This case study features a customer that was a residential home builder. The company was infected with a zero-day exchange exploit which was a Microsoft issue. A flaw in Exchange allowed bad actors to compromise their systems. I think it came out in March and the customer remediated it, or so they thought — they really hadn’t. This hole existed for six months before it came to light in the form of ransomware.
They began work in the morning and stuff was going wrong on the systems. We see this a lot with ransomware. They go in and they try to disable the system, keep you from doing anything right. They’ll encrypt your backups if they can. If they can’t, they’ll delete them. They will change passwords wherever they can. They’ll start locking down your DNS or DHCP. They’ll block access to firewalls. If you have the right credentials, you can do anything in an environment with PowerShell. This customer, again, wasn’t able to get on the internet. They couldn’t log into their HyperV servers. That’s when they started making phone calls. And so again, it was early in the morning. We got a call, “Hey, we think something’s going on. We’ve been hit.” And Steven and the team disabled the tenant.
They called us between five, and seven in the morning, letting us know they were attacked. They didn’t know what they were hit by, so we went ahead and started the same process. We kicked off our proprietary script in the background, and we disabled the tenant. We run that process to automatically clone your data. And then once that’s done, it kicks off the automated build process, which brings your servers up into an instant recovery state or what you might hear us say, “IR.” Then we go ahead and put them on your own VPN that’s assigned to you. You’re the only customer on that. And then you access the servers.
When we bring these servers up, it’s important to scan your servers as these attacks laid dormant and there may be more. I don’t want to go into the woods on this too much, but we had one customer who did scan it and correct me if I’m wrong, Kelly, but they did find additional threats on their system by running a virus scan or scan to detect additional malware. They were cleaning stuff off in our environment.
You have to try to remediate the virus after we bring them up online, and then just make sure that there’s nothing dirty on the systems before we provide internet access because as soon as we flip on the internet, that’s a trigger. These things normally go out and ping servers to verify you have internet. As soon as you have internet, as soon as it pings, a server gets a reply and your DR environment just got encrypted. Good thing for the business continuity disaster recovery team in that it’s fairly easy for us to just kick off that script again, reclone your data and not turn the internet on.
This customer didn’t actually remove everything from their system, and it detected the internet access and it triggered again. Sometimes this happens by certain dates, or certain events can trigger it. They were losing millions of dollars every day, so we had to get this customer restored as quickly as possible to stop that number from keep going up. Our goal is always to get you up as quickly as we possibly can to stop that ticker from counting up.
Unfortunately, this customer faced a few more challenges than we expected. We had them up and running per SLA, but there were issues with the versioning, and they had three antiviruses running on their systems, which is generally not good. More is not better. This goes back to the: how to have a better DR experience. You need an antivirus, so use the best antivirus you can, but don’t use multiple ones. Use something with active script blocking and other things like that. Use a SIM to help with that. Every time we’d spin the servers up, all three anti-viruses would sit there and fight with each other, trying to scan because they knew something was wrong.
It took a while for us to be able to get into the systems and make them stop. Once we were able to get past that hurdle, we had internet bandwidth issues, so the customer had some people moving servers around trying to improve that. Meanwhile, they were trying to rebuild their systems on a completely new operating system, and kind of do an upgrade at the same time. It wasn’t the best scenario, but we still delivered the product.
I mentioned patch management previously. Make sure you’ve got your patching processes in place, make sure that you are keeping track of your zero-day vulnerabilities, your critical vulnerabilities, and patching and rebooting your servers as quickly as you can. Hackers know they’re there, and they’re going to use them. As Steven said, they can drop time bombs in your system to where they get in the first week or two, but if it takes you a month to patch those, they may have injected a time bomb, a malware package sitting on your server for 3, 4, 6 months until one day it goes off. And by that point, everything has quietly made its way around your network. We see that all the time.
If you’ve heard of AI-powered detection where it goes out, and it says, “Okay, this is Steven’s activity as an antivirus or, or an anti-whatever prevention, and I have learned what Steven’s typical activity looks like. And so his behavior on a day-to-day basis is known to me, and I’m not going to stop it.” So this malware will sit there, and it’ll start looking like a person by design, and it’ll just start creeping out to where everything that’s happening becomes normal. So when it does its damage, and it deploys itself, nothing has stopped it because it’s seen as normal activity. Again, I’m not making this up. This didn’t necessarily happen to this customer, but it is a real-world scenario, and you need good cyber protection and something like enhanced data protection to protect you. Hopefully, we haven’t scared you too much. You know, these are, again, real-world scenarios that we’ve dealt with, that we deal with on a weekly basis to help customers out.
After the DR event, what do we do? So you have to take stock. And when we say after the DR event, there are two parts of that. There’s the failover immediately after you discover that a disaster has happened and then there’s after the entire event and everything is back to running the way it was. So what we typically see is that the customers have their backup server joined to a domain. These malware ransomware scripts and agents will grab domain admin credentials out of somebody’s browser. As a domain administrator or a systems administrator, you may log into a system on your network with the main admin credentials. You may then use that browser to go out onto the web, go to some site that has malicious code on it. And suddenly, your domain admin credentials get scraped. That script now knows where you came from, can reverse and follow you back through your IP address, go into your systems and start doing whatever it wants to with those credentials. Knowing that it will try to encrypt your backups, Veeam already encrypts your backups by default. Since the malware can’t encrypt the backups, it just deletes them since it has your windows. Your backup server was joined to a windows domain and it has domain admin credentials. It can do that. It can access your cloud back backups and delete those as well. So you have to start assessing the damage, like how do I recover? What exactly has been affected? Is it all my systems? Is it servers or stations? Is it multiple locations? Like Steven said a little bit ago, this happens very, very quickly. When you hear about people ripping plugs out of the wall, that’s almost what you have to do to stop this from spreading. And by that point, it’s probably too late anyway. So Steven, once we failed over, can you talk us through the fail back and what that looks like?
First of all, you’re going to need to make sure that you have a clean environment to fill back into because the one thing that you do not want to happen is that you schedule the downtime, you work with our team to get you filled back, and then you find out that a workstation was left up and it was still infected, and it just reinfected your entire environment.
Whenever you spin your servers up here, we back them up in one of our DR backup servers, and then we make you the cloud tenant, and we send the backups back down to you. You’ll need to schedule time or a fail-over window to fill back into your production. So, like I said, normally 5:00 PM Friday, to 8:00 AM Monday, we’ll do a final backup here. We’ll sync those changes back down to you, and then you will restore your VMs on Monday morning. Once we schedule this fail-back window, we normally don’t flip the switch back on to turn your site back on because then we would have to reschedule it.
To plan for this, you’ll need to make sure you have a clean environment, and nothing will reinfect the servers we have on our side. Then we’re going to need to plan the fail back into your infrastructure. And one of the most important things that you need to do is to make sure that you have licensing for your new equipment or licensing for your existing equipment so that we have the ability to fail you back. Make sure everything comes up like normal, and everything’s working. After we fail you back, we’re going to make sure that your servers are functioning as they’re supposed to. And then, at that point, we’ll go ahead and start the backups again.
It’s important to think outside the box and on a broad scale because if you think of all your employees and all your devices, you could rebuild everything that you have in your building, right? If you have one building, everything’s there, you rebuild your servers, you rebuild desktops and things that are there. Here comes Trista with her laptop. Hasn’t been to the office in a couple of weeks. She works remotely or it’s the pandemic or whatever, and she brings this laptop that’s still infected because it hasn’t been updated. Doesn’t have new antivirus on it. It hasn’t been scanned or cleaned, and it hasn’t been rebuilt. She drops that thing back on your network. And guess what happens? Steven gets a call at five o’clock in the morning, again, saying we need to spin everything back up in the DR because everything got reinfected. Make sure that you’re, you’re thinking about everything and anything that can potentially touch your environment.
The last thing is the after-action report. This is for mainly for customers and for end users and folks they work with. The Dataprise cybersecurity services that we offer, do have an incident response, and it helps with assessments after the fact. Basically, you go through and identify where you think the situation started. What was the ingress point that was affected first? How was it infected? Was it a zero-day vulnerability? Was it some other patch? You may have an idea about some of those things by this point, and then you create the assessment. A lot of times, the law enforcement or your cybersecurity insurance people will want that. You may need to file it for those insurance reasons. You may need it for audit, et cetera. So if you need help with those types of things, we can certainly help you with that. But it’s important to make sure you do an assessment, kind of a formal lessons learned to make sure that those things aren’t allowed to happen again and you have better control over your environment.
Why BaaS when you can DRaaS?
They’re very similar technologies. You’re already sending data to the cloud for backup as a service. Why would you stop at that? If you don’t have the ability to spin those systems up? The spin up part is key as we’ve discussed here today, and it shouldn’t be overlooked. When you talk about backup as a service only, you’re limited to sending your backups. If you have to restore, if you have an outage, you must download all of that data first. You’ll have to pull it back down to some system, and then you have to restore it. Your outage window goes from an hour or a couple of hours into days or weeks, potentially. It depends on the internet. It depends on the systems that you have available, etc. As much as I love it, I’m going to let Steven talk about enhanced data protection. We have a little graphic over here that outlines what a perimeter from the backup server in our cloud area looks like.
EDP has been a lifesaver for our customers and their businesses. The EDP data, or “enhanced data protection,” is what we call it, is a copy that lives outside of your repository that you do not have access to. The reason why you do not have access to it is that if you can get to it, malware or viruses, or ransomware can get to it. We keep it in a near-gap repository. Whenever you do your test, we always spin you up from your EDP data to verify functionality.
If you are a customer of GDV and aren’t sure if you have EDP, you can log into your quality portal or refer to your status emails to verify. If you do not have EDP, please reach out to your sales rep to get that added today.
The other thing to consider is what if nobody’s watching the monitors and the screens and the control gauges and all that? A lot of times this nefarious activity can be detected if you’re using the appropriate systems, if you’re using intelligent log gathering and log parsing, and you have the correct antivirus in place. You have the procedures and the team in place. And when we talk about, “as a service,” not everybody has the resources to staff cybersecurity experts. We have some very good experts and we do work 24-7. If you’ve got cybersecurity incidents happening in the middle of the night, we can be there to alert you for that. And that’s very important in the whole scheme of things, right?
We talked about how patching and regular checkups, both on the cybersecurity and the DRaaS level, are important and goes in line with testing. Steven talked about testing your backups, testing your DR spin ups. You just want to give your environment a thorough checkup. Look through everything, look through your settings, make sure that your current systems and technologies are up to scale with what’s going on in the world today. What got you here, won’t get you there, right? What we design for and design against to protect against today may be obsolete a year from now. Technology and the bad actors get more and more creative. We all constantly have to improve — not only our cybersecurity practices — but your disaster recovery practices, make sure your business impact analysis is up to date, your call trees, et cetera. You just don’t ever want to get behind the eight ball on things like this and find out in the middle of a disaster while your building’s burning down that you don’t know who’s supposed to have called the disaster recovery company. Make sure you’re keeping your DRaaS provider updated of changes as well. Anything else you want to highlight here, Steven?
The only thing I can really highlight is the importance of security as a service and a 24-hour SOC. If you get something malicious on your servers, you want somebody there to monitor your systems via the SIM programs to catch these bad actors, to catch these bad login attempts, and call you immediately on your cell phone if they do detect something. Not only is business continuity and disaster recovery a good investment, but it also pairs well as security as a service. You want stop them from getting into your systems in the first place before they can do something malicious. I would reach out to your sales rep to see if you can get the Dataprise security team involved and see if we can’t get you a good solution if you don’t already have one in place.