Recent stories about a ransomware attack on UnitedHealth and its extended impact on UnitedHealth's daily operations for nearly a month generated conversation online and (likely) in many boardrooms about the real possibilities of harm from ransomware and the readiness of organizations to recover from such attacks. Such concerns are very appropriate. I suspect the directions provided by leaders in such organizations to prepare for such attacks are NOT wholly appropriate because it is likely many IT and security experts within those organization and their leaders lack a clear understanding of the design characteristics of their current systems and the operational impacts of those designs, especially in the scenario where all data has been purposely destroyed or made unavailable.
A Key Distinction
The key distinction to make in the realm of harmful software is the intent of the originator. Software termed malware is designed to harm its target in some way and generally includes no capability to "un-do" its damage. The creators' motivation might be spite, like a bunch of teenagers sitting in an online chatroom, or it could be an economic or political entity attempting to seriously harm an opponent. That harm could stem from possibly irreversibly scrambling data or using existing software for unintended purposes (like spinning centrifuges at twice their rated speed to destroy them and prevent refinement of uranium). If data is the only object damaged in a malware attack, the ability of the target to recover the data depends upon the competency of the malware creators. If they knew what they were doing and truly randomly scrambled or simply overwrote the data, there's no practical way to recover using the data in place. Backups are the only avenue of recovery.
Software termed ransomware isn't designed to permanently harm a target (though permanent harm CAN result). It is instead a tool used as part of a business model for extorting cash from victims. Creators of ransomware want to make money. It's impossible for them to make money if their ransomware mechanism proves to be significantly less than 100% reversible. If a target is attacked with a ransomware tool that has scrambled 50 other large companies and only five were able to recover using the creator's recovery process, few other firms will pay the ransom and the creator's business model will collapse and if they continue, their tool ceases being ransomware and has the same effect as other malware.
Attack Architecture
Malware and ransomware ("hackware") are both very similar in one key area. Both adopt similar "architecture" in the layers of software used because both require the same appearance of an INSTANT attack to achieve their goal and avoid being disabled. Of course, if attacking a large corporation with 50,000 employees with 50,000 laptops and two data centers with 10,000 servers running internal and external e-commerce systems, it is IMPOSSIBLE to literally attack all 60,000 machines simultaneously. Most hackware is designed in layers that, for purposes of explanation, will be termed shim, full client and command ∓ control.
The shim layer is the piece of software that exploits software the target is already running to make that software do something it wasn't INTENDED to do but is PERMITTED to do. Ideally, this additional action LOOKS like regular activity that "machine zero" might perform to avoid triggering alerts about an unexpected process running or an unexpected attempt to reach some other remote resource. Note that the software targeted by the shim is NOT necessarily the ultimate target of the hackware. That initial point of infection is only the weak link being exploited because the hackware creators learned how to corrupt it to do something else useful in their attack and the target company happens to run that software. In the SolarWinds attack of late 2020, data managed by SolarWinds within a target company was NOT the actual target of the attack. It was just a widely use piece of enterprise software with a vulnerability the hackers learned to exploit.
The exploit leveraged by the "shim" layer may not allow a large enough change in the software being corrupted to perform the real action to be invoked by the hackware. The shim may instead target OTHER installed software or install NEW software to actually implement the real bad action to be performed at the time of the eventual attack. That software is the real "client" of the attack. Since most PCs and servers run anti-virus software looking for unexpected binaries or new processes, the client layer of most hackware relies upon being able to masquerade as something already allowed or upon being able to interfere with those scanning processes and discard their alerts. The key concept to understand at this point in the narrative is that the time of initial infection (by the shim) or "full infection" (by the client) is NOT the time of the attack. The process of infecting dozens / hundreds / thousands of machines while evading security monitoring tools takes time. Not just hours or days. Weeks. Months. (This has huge cost impacts on mitigation strategies to be explained later.)
Since full infection can take an extended period yet the goal of the hackware is to appear to attack simultaneously, most large scale hackware attacks leverage an external "command and control" layer which performs multiple tasks. It tracks "pings" from each infected machine to trace the progress of the original "shim" infection or the "full client" infection. In many cases, the hackware creators aren't targeting a particular organization in advance, they are learning who they infected via this telemetry and deciding if they want to ALLOW the attack. Since this telemetry can disclose public IP addresses of the infected machines, those addresses can help the hackware creators confirm the size of the target and decide how long to wait for additional infections before triggering the actual attack onset. For example, if a PING comes from IP 201.14.92.52 and that is part of a block operated by Joe's Bait & Tackle, the originators may just skip him. If the block is operated by Gitwell Regional Hospital in Podunk, AR that operates 90 beds, they might wait for another 40 or 50 machines to PING before triggering attack. If the block belongs to Ford Motor Company and only 4000 machines have PINGed in, they may wait until they see 50,000 to 60,000 before pulling the trigger.
The process of "pulling the trigger" is also designed in a way to avoid detection. Obviously, a firm whose security software sees 60,000 laptops all continuously polling some IP address in Russia is likely to detect that and get a heads up that trouble is looming. Instead, the "full client" running on each infected machine may be written to "poll" for instructions on a random interval over DAYS to get the final green light date and time of attack. Since most laptops and servers in Corporate America use NTP (Network Time Protocol) to sync onboard clocks down to the millisecond, once thousands of infected systems learn the attack date and time, they all just wait for that time to arrive and do not have to sync with the mother ship or each other to yield a simultaneous onset of the attack. Included with the green light and attack date/time will be a cryptographic key each client should use to generate an internal key to encrypt the data. If the attacker actually does honor any ransom payment, the command and control system will also signal the "paid" status the clients can use to reverse the encryption.
Ransomware Recovery
As mentioned before, options for recovery from a malware attack are slim. If the infection actually reached the onset phase, there will usually be no available method of recovering data. The creator had no such intent for recovery to be possible and the victim will likely lack the time and expertise required to reverse-engineer the malware, determine HOW it worked and whether any recovery is possible and code a fix. The only path forward is to identify all infected machines, quarantine them, then wipe and re-load each infected machine with clean software. If any infected machine remains on the network with clean machines, the infected machine can re-infect newly cleaned machines, getting you nowhere.
For ransomware, victims have to approach recovery with a split brain. On one hand, because it is ransomware, a short-term restoration may be POSSIBLE but only if the victim's leadership and legal counsel can agree upon a ransom amount and only if the attacker's recovery process actually works. If the victim is among the first victims of a new ransomware variant and the recovery process cannot be verified before paying the ransom, the victim may be taking a huge risk. Even if the recovery appears to work, the victim will STILL need to literally wipe and reload EVERY machine touched by the ransomware, whether it triggered on that machine or not. Once a machine has been compromised, it will require a complete reload. This process can still require the victim to incur multiple outages, extended maintenance windows, etc. as the production applications are migrated to new, wiped machines while other infected machines are systematically taken offline, wiped, reloaded and brought back online. And the victim will need to audit every BYTE of affected data to ensure no data was altered intentionally or inadvertently by the ransomware process.
For victims, the other half of their split brain process requires proceeding as though the ransom WON'T be paid or WON'T WORK and they have to begin recovering from backups. At this point, the complexity factor and costs grow exponentially for the victim. No large corporation operates a single monolith application with a single database with contents reflecting the entire state of all customer / employee / vendor relationships at a single point in time. Functions are spread across DOZENS of systems with specific data elements acting as "keys" linking a record in SystemA to a record in SystemB which maps to a different table in SystemB with a key mapping SystemB to SystemC, etc. Each of these individual systems may house records for millions of customers over years of activity and may be terabytes in size.
For large systems, the databases housing their records support multiple approaches for "backing up" data in the event of hardware failures, mass deletion (by accident or malice) or corruption. A "full backup" is exactly what it sounds like. Making a copy of every database table, database index, database sequence, etc. involved with an application and moving that copy to some other storage. If the database is one terabyte in production, that full backup will also take up one terabyte. In most companies, a full backup is created monthly or weekly. An "incremental" backup uses the database's ability to identify all records / changes made AFTER an explicit point in time (a "checkpoint") and copy just those records to a separate set of files tagged with that checkpoint. Incremental backups are typically taken every week or every X days.
By performing FULL and INCREMENTAL backups, if data is completely lost in production, the newest FULL BACKUP can be restored first, then all INCREMENTAL backups performed AFTER that full backup can be restored atop the full backup to restore the system to a state as close to the present as possible. As an example, a firm making MONTHLY full backups and WEEKLY incremental backups should never lose more than one week of data if they have to restore a corrupted system. Narrowing that potential data loss window involves reducing the intervals between the full and incremental backups but doing that is not pain free or cost free. More frequent backups require more disk storage and more network capacity between the database servers and the SANs housing the storage. If backups are to be copied offsite for additional protection against corruption or acts of nature, the storage and network costs can easily double.
The REAL complexity with recovery of lost databases lies in the synchronization of data across systems and across those backup images. To illustrate the problem, imagine a company with ten million customers that gains 80,000 customers per month and divides its online commerce, billing, inventory, shipping, agent support and customer support functions across five systems. For each customer, there will be
* customer account numbers
* order numbers
* serial numbers in inventory
* order numbers, shipping numbers and serial numbers in shipping
* trouble ticket numbers in the agent support function
* customer account / login / serial number information in the customer support function
With that customer activity, one could imagine daily order activity reaching 2,667. If most of that activity is concentrated over 16 hours, that's 167 orders per hour. (Multiply these numbers by 1000 if they aren't sufficiently scary.)
Even if the firm is creating backups for these systems religiously on a fixed schedule, there is no way to synchronize the backups to start and complete at EXACTLY the same time. One backup might finish in 20 minutes, another might take 2 hours. When each of those backups completes, they might all reflect data as of 3/14/2024 but their actual content might vary by 100 minutes worth of online orders, etc. If the company becomes a ransomware victim and these systems are restored from full / incremental backups, it is possible for a system that ASSIGNS a key such as "ordernumber" to be BEHIND other systems which reflect ordernumber values used prior to the ransomware corruption. For example,
BILLING: newest values prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- the absolute most recent assigned ORDERNUMBER
SHIPPING: newest values seen prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- in sync with billing (good)
If these systems are restored from backup and the newest combination of (full + incremental) backup for BILLING winds up two hours BEHIND the newest combination of (full + incremental) backup for SHIPPING, the victim could wind up in this state:
BILLING: newest values prior to ransomware attack:
* 1111074
* 1111075
* 1111076
* 1110791 <--- missing 334 values from 1110791 thru 1111125 -- danger
SHIPPING: newest values seen prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- correct but "ahead" of BILLING by 334 values (not good)
With this content, if the BILLING system isn't "advanced" to use 111126 as the next assigned ORDERNUMBER before being restored to live use, it will generate NEW orders for 334 ordernumbers that have already been assigned to other customers. This scenario can create two possible problems. At best, it can create an obvious relational data integrity conflict that might cause the application to experience previously unseen errors that block an order from being processed, etc. A far worse scenario is that the applications cannot detect the duplication of ORDERNUMBER values and use them for a second customer. This might allow customers to begin seeing the order contents and customer information of a different customer's order.
This is one scenario where all of the system backups were nominally targeted to run on the same full and incremental frequencies at approximately the same dates and times. What if March 14, 2024 at 10:00pm was chosen as everyone's restoration target but the incremental backup for DatabaseC for System C for that date/time is corrupt for other hardware reasons (hey, stuff happens...) and the next most recent backup is from two days prior? Now that data gap could be two days worth of transactions, posing a much wider opportunity for duplicate keys that will either breach confidentiality or trigger low level database faults in the application causing more actions to fail for employees and customers.
It is possible to design new systems from scratch to synthesize random values for use in joining these types of records together to avoid this synchronization problem. However, most existing systems built over the last twenty years were not designed with these types of recovery scenarios in mind. They were designed for relational integrity in a perfect world where no one could corrupt the assigned keys and a new true state never had to be re-assembled from out of sync backups. Since the systems weren't DESIGNED with these potential problems in mind, no actual administrators or developers have contemplated the work required to analyze the data in the recovered systems, identify the newest keys present in each, then alter all other databases to skip to the newest values while the gaps in the older records are filled back in by hand.
Ransomware Mitigation Strategies
AI as an answer? In a word, NO. There may be consulting firms willing to pocket millions in professional services and develop a slide presentation for executives stating an organization has covered its risks from ransomware. However, no honest firm can state confidently that an organization will a) AVOID a ransomware attack or b) RECOVER from it without affecting its mission and incurring exorbitant expenses in recovery. Why?
First, existing security software generalizes PRIOR attack schemes to devise heuristics to apply to current telemetry to identify potentially similar system behaviors characteristic of known threats. These systems to date cannot PREDICT a novel combination of NEW infection paths and behaviors to spot a new attack in its tracks before an organization becomes "victim zero" for that attack.
Second, the ultimate meta version of the first argument is that artificial intelligence will be capable of overcoming the limitation of analyzing PRIOR attacks for automating discovery and prevention. Certainly, security firms will become touting "AI" capabilities in pitches to executives trying to convince them that AI-based intrusion detection systems will solve the rear-view mirror problem. By definition, applying AI to this problem requires ceding control of an organization's system TO an AI system. By definition, if the AI system is worth its licensing fees, it has "understanding" of security vulnerabilities and protective measures that humans do NOT have or are incapable of explaining. But if the AI system's "understanding" is beyond human understanding, that same AI system could be compromised in ways which facilitate its exploitation in an attack and the victim might have even less knowledge of the initial "infection phase" or ways of recovering if the AI is in fact complicit.
Is this far fetched? ABSOLUTELY NOT. The SolarWinds breach that occurred in late 2020 had nothing to do with AI technology. However, it was an example of a supply-chain based breach. As more network monitoring and security systems are integrated with external data to provide more information to identify new threats, the instances of those systems within companies are trusted to update more of their software AUTOMATICALLY, in REAL TIME, from the vendor's systems. If an attacker wants to attack Companies A, B, C and D but knows they all use AI-based vendor X for intrusion detection, an attacker can instead work to compromise vendor X's system to deliver their "shim" directly to A, B, C and D without raising an eybrow. Administrators at A, B, C and D can diligently verify they are running the latest sofware from X, they can verify the digital signatures on that software match those on the vendor's support site and they'll have zero clue the software from X is compromised. In the case of the SolarWinds attack, the shim had been embedded within a SolarWinds patch that had been installed at many customers for MONTHS prior to it waking up to perform its task.
Air-gapped systems.? The value of air-gapped systems is complicated. Creating completely isolated environments to run back-up instances of key applications or house the infrastructure managing backups is recommended by many security experts. Conceptually, these environments have zero electronic connectivity (Ethernet or wifi) to "regular" systems, hence the term air gap. This separation is required because of the long potential interval between "infection time" and "onset time." An attacker who is patient or is coordinating a very complicated, large scale attack may stage the infection phase over MONTHS. (Again, in the SolarWinds case, some victims found their systems had been compromised for three to six months.) Once a victim realizes their systems are corrupt, they need a minimum base of servers, networking, file storage and firewall hardware they "know" is not compromised to use to push clean operating system images, application images and database backups back to scrubbed machines. If this "recovery LAN" was exposed to the regular network that is infected, then the "recovery LAN" gear could be infected and the victim has no safe starting point from which to rebuild.
Any organization implementing an air-gapped environment must be realistic in the costs that will be involved to build it and maintain it. An air-gapped environment can be pitched in two ways. One is strictly as a "seed corn" environment -- one sized only to handle the servers and storage required to house backups of databases, backups of server images (easier to do with modern containerization) and routers, switches and firewalls required to use the "air gap" environment as the starting point to venture back out into the rest of the data center as racks are quarantined, wiped, reloaded and put back online. A second way some organizations are trying to think of an air-gapped environment is as a second form of a disaster recovery site -- one housing enough servers, storage, networking and firewall resources to operate the most critical applications in a degraded, non-redundant state. Duplicating hardware for even this reduced set of capability is very expensive to build.
More importantly, regardless of which flavor (seed corn or mini DR) is pursued, this extra air-gap environment is a perpetual expense for hardware, licensing and personnel going forward. The likelihood of most senior management teams agreeing to this large uptick in expense and consistently funding it in future years is near zero. Part of the reason most systems in large corporations exhibit the flaws already described is because "business owners" who wanted the applications are only willing to support funding when they are new and delivering something new of value to "the business." Once that becomes the current state, extracting funding from those "business users" to MAINTAIN that current state and modernize it even if new functionality isn't provided involves rules of logic and persuasion which only function in the Somebody Else's Problem Field. (I will give the reader a moment to search that reference...)
Active / Archive Data Segmentation One important strategy to minimize operational impacts of a ransomware attack and minimize recovery windows is to more explicitly and aggressively segment "active" data needed to process current actions for users / customers from "archival" data used to provide historical context for support or retention for regulatory compliance. An application serving ten million customers over the last five years may have a terabyte of data in total but the space occupied to reflect CURRENT customers and CURRENT pending / active orders might only be five percent of that terabyte. In this scenario, continuing to keep ALL of that data in a single database instance or (worse) in single database TABLES for each object means that if the database is lost or corrupted, a terabyte of data will need to be read from backup and written to a new empty recovery database before ANY of the records are available. The same would be true if the database was lost due to innocent internal failures like a loss of a SAN housing the raw data.
Applications for large numbers of customers can be designed to archive data rows no longer considered "current" to parallel sets of tables in the same database or (preferably) in a separate database so the current data required to handle a new order or support an existing customer with a new problem can be restored in a much shorter period of time. Data considered archival can be treated as read-only and housed on database instances with no insert / update capabilities behind web services that are also read only. Network permissions on those servers can be further restricted to limit any "attack surface" leveraged by ransomware.
From experience, virtually no home-grown or consultant-designed system for Corporate America takes into account this concept. This results in systems that require each "full backup" to clone the ENTIRE database, even when ninety percent of the rows haven't changed in two years. More importantly, when the system is DOWN awaiting recovery, a restoration that should have taken 1-2 hours might easily take 1-2 DAYS due to the I/O bottlenecks of moving a terabyte of data from backup media to the server and SAN. If there are six systems all involved in a firm's "mission critical" function, these delays mean you have ZERO use of the function until ALL of the systems have completed their database restoration.
Designing for Synchronicity Issues No application I encountered in thirty plus years of Corporate America was designed with a coherent view of overall "state" as indicated by key values across the databases involved with the system. Most systems were organically designed to start from id=0 and grow from there as transactions arrived. They were NOT designed to allow startup by looking at a different set of tables that identified a new starting point for ID values of key objects. As a result, recovery scenarios like those above where not all objects could be recovered to the same EXACT point in time create potentially fatal data integrity problems.
Going forward, software architects need to explicitly design systems with a clear "dashboard" regarding the application's referential integrity to allow ID values of all key objects to be tracked as "high water marks" continuously and to allow those high water marks to be adjusted upwards prior to restoring to service to avoid conflicts if some databases had to revert to older points in time with older ID values.
Ideally, for an entirely new system with no integrations to legacy systems, use of any sequential or incrementing ID value in the system's data model should be avoided. Doing so would allow a system to be resurrected after infection / corruption and begin creating new clean records immediately without the need to "pick up where it left off" without an accurate view of where that point really was. This is a very difficult design habit to break since many other business metrics rely on ID values being sequential to provide a quick gauge of activity levels.
Integrated Training for Architects, Developers and Administrators The roles of architects, software developers and administrators are highly specialized and often rigidly separated within large organizations. It is safe to say most architects are unfamiliar with the nuts and bolts of managing and restoring database backups across legacy SQL technologies, cache systems like Cassandra, Redis or memcached or "big table" based systems like Hadoop and Big Table. Most developers are not exposed to the entire data model of the entire application and instead only see small parts in a vacuum without context. Most administrators (database admins and server admins) know operational procedures inside and out but aren't involved at design or coding time to provide insight on model improvements or spot potential problems coming from design.
Because of these behavioral and organization silos, very few applications are being designed with the thought of gracefully surviving a massive loss of data and restoration of service within hours rather than days or weeks. Allowing for more coordination between these areas of expertise takes additional time and these teams are already typically understaffed for the types of designs common ten years ago, much less the complexities involved with "modern" frameworks. One way to allot time for these types of discussion is to at least add time to new system projects for analysis of disaster recovery, failover from one environment to another and data model considerations. However, past experience again indicates that attempts to delay a launch of a new app for "ticky dot" stuff is akin to building a dam then asking EVERYONE to wait while engineers sanity test the turbines at the bottom. Even though a misunderstanding of the design or build flaw MIGHT cause the dam to implode once the water reaches the top, there will always be executives demanding that the floodgates be closed to declare the lake ready for boating as soon as possible. Even if a problem might be found that proves IMPOSSIBLE to solve with a full reservoir.
The More Sobering Takeaway from Ransomware
As mentioned at the beginning, there are few meaningful technical distinctions between the mechanisms malware and ransomware exploit to target victims. The primary difference lies in the motivation of the attacker. As described in the mechanics of these attacks, an INFECTION of a victim of ransomware does not always lead to a actual loss of data by that victim. It is possible far more victims have been infected but were never "triggered" because either the attackers didn't see a revenue opportunity that was big enough or they saw a victim so large in size or economic / political importance the attackers didn't want to attract the law enforcement focus that would result. The economics of ransomware work best on relatively small, technically backward, politically unconnected victims who can pay six or seven figure ransoms and want to stay out of the news.
Ransomware creators have likely internalized another aspect of financial analysis in their strategy. The cost of creating any SINGLE application capable of operating at Fortune 500 scale is typically at least $10 million dollars and the quality of such applications is NOT good, whether measured by usability, functionality, operational costs or security. The cost of integrating MULTIPLE systems capable of operating at Fortune 500 scale to accomplish some function as a unit can readily approach $30-50 million dollars and since the quality of the pieces is typically poor, the quality of the integrated system is typically worse.
Leadership is typically convinced that if it costs serious dollars to get crap, it will cost even more to FIX crap and even more to design a system that ISN'T crap from the start. Since current systems for the most part "work" and the company isn't appearing on the front page of the Wall Street Journal, leaders adopt a "Get Shorty" mindset regarding spending money to fix or avoid flaws in their enterprise systems that will only arise in a blue moon. "What's my motivation?"
Well, as long as they don't get hit, there is no motivation. If they DO get hit but the ransom is only (say) $1 million dollars, they look at that as a sophisticated, rational risk taker and say "I avoided spending $10 million to fix the flaw, I got burned but it only cost me $1 million and a couple of days of disruption? I'm a genius." Frankly, that is the mindset the attackers are betting on. If they started charging $20 or $30 million to return a victim's data, those victims would definitely be rethinking their IT strategy, vulnerabilities would decline somewhat, and fewer companies would pay.
As stated before however, that mindset of rationalized complacency does NOTHING to protect an organization if the attacker actually wants to damage the company. This is a sobering point because the same attack mechanisms COULD be used at any point for far wider economic, social or military damage. These more drastic outcomes are NOT being avoided because the money spent on existing intrusion detection mitigation tools is necessarily "working" or that large corporations are simply ready for these failures and are curing them quickly / silently as they arise. These more drastic outcomes are not yet happening solely because those with the the expertise to initiate them are still choosing to limit the scope of their attacks and in many cases are in it for money. Businesses and governments are NOT prepared to fend off or recover from the type of damage that can result if these same capabilities are leveraged more widely for sheer destruction. In a data-driven world, mass destruction of data can directly cause mass destruction and disruption at a scale not previously contemplated. Organizations who state with confidence they are ready for what comes their way really aren't grasping the full picture and are likely in denial about the survivability of their systems.
WTH