Wednesday, July 24, 2024

This Week in Computing News

It has not been a fun week for users and owners of computers. First the news, then the observations.


CrowdStrike

The basics are by now well understood. A security monitoring application called Falcon sold by CrowdStrike to at least half of the Fortune 1000 attempted to push an update for new virus signatures from its centralized control system to millions of its client's machines and wound up "bricking" every one of those clients, requiring a human administrator with a special USB disk image to reboot the machine into a "safe" mode to then delete one corrupt file then reboot the macine to restore normal operation.

The problem was caused by a flawed software testing and deployment process. The actual corrupt file was populated with all-zero values rather than actual signature data expected by the Falcon code. The Falcon code actually runs as a kernel driver within the Windows operating system's "inner sanctum" of trust (ring 0) so if ANYTHING goes wrong with code running at kernel level, the system has no choice but to "blue screen" to prevent further corruption of user data or the operating system.

Due to the criticality of code running in the kernel, Microsoft requires any vendor delivering "code" that runs in the kernel to undergo explicit testing and certification to obtain a digitial signature that is checked when that code is loaded into the kernel to prevent corrupted / unexpected code from running. The Falcon "CODE" **was** signed . But the actual file that CRASHED the Falcon code was considered DATA and wasn't individually signed via the Microsoft certification process. However, the DATA in that corrupt file caused the rest of the Falcon code to crash, triggering the "blue screen" lockup.

This failure apparently bricked nearly ten million machines before CrowdStrike stopped its push. That points out a variety of problems for CrowdStrike and its corporate customers.

First, CrowdStrike clearly has glaring flaws in its quality control and testing processes at development time. There is no way a "build" of "code + data" that created this behavior should have escaped detection in testing and a block in build automation that combined the code and data into a final release.

Second, CrowdStrike's operations and monitoring systems collecting data from all of those millions of client machines clearly lacks any logic that puts a short circuit in place for unexpected events. If you operate a software product that CONTINUOUSLY collects status / forensic data from MILLIOINS of endpoints and also PUSHES updates to those MILLIONS of machines, any time you push a release and the recipient "disappears" off the network and never comes back with a new status update, you have a problem. When EVERY client you've pushed an update to has disappeared and never reached out to the mother ship, you should have logic that immediately halts all outbound software pushes. It seems CrowdStrike had to rely on customers physically contacting them to stop this software push.

Third, this type of "supply chain" vulnerability needs an operational and legal re-think, both on the part of software companies using this operating model and customers agreeing to pay for this type of service. In most IT and network departments, most competent veterans are wary of instantly loading the latest release from ANY vendor. "Let the vendor do their R&D and beta testing on someone else's network." That's the normal mindset.

For SECURITY software, most IT and network administrators have been forced to override that default caution. BY DEFINITION, updates to security software are supposed to be supplying new "signatures" that allow your baseline anti-virus and malware detection software to detect and stop newly identified threats to your systems. The whole point of these cloud based systems is that they can detect odd patterns of behavior seen across thousands of customers, correlate it to some new piece of malware , then release a "signature" that tells the client agent how to find that malware and neutralize it before it infects a system. You are SUPPOSED to accept those new signatures as soon as possible to minimize infection risk.

The problem is that those signature files are just data. And bad data read by flawed software, even if the software itself is unaltered and digitally signed for trustworthiness, can crash in unexpected ways and, in some cases, unrecoverable ways. Yet many large companies use products like CrowdStrike partly as a means of mitigating their legal risk to thier own customers. If you are operating a Fortune 500 firm WITHOUT these types of tools then get hacked and shutdown with ransomware, your stockholders will inevitably sue your board, CEO and CIO for incompetence and failure to protect the ability of the company's assets to produce income for stockholders. Your customers may also likely sue you for failing to deliver expected services that a reasonable customer would have the expectation of being delivered, be it hospital care, electricy, internet service, etc.

Corporate IT leaders are clearly going to need to re-think the operational controls in place around their systems. Anti-virus and malware systems will always be required but it may be the case that NO VENDOR will ever be selected and deployed if the corporate customer doesn't have control over how software updates are pushed out in "waves" to ensure a "brick" problem can be spotted after a few dozen machines rather than allowing the vendor on their own to push to 100% of all covered machines without intervention.

In the case of CrowdStrike, this isn't apparently a "pattern of one." CrowdStrike released an update for its client running on Debian Linux machines in April 2024 that created a nearly identical brick problem. They did the same think for an update for Rocky Linux machines in May 2024. But the pattern goes back further than that. Much further.

In 2010, McAfee had a similar "signature update" for its AV software that corrupted the host service on Windows machines and prevented them from communicating with any other system after reboot. The CIO at the time at McAffee? George Kurz. The current CEO of CrowdStrike? Same George Kurz.


Microsoft Patch Woes

Not to be confused with the CrowdStrike issue above which INVOLVED Windows operating system computers but was not the direct cause of Microsoft itself, Microsoft itself DOES have a new problem with software updates it began pushing in July 2024. A patch for Windows 11 machines incurred a bug which resulted in machines with its BitLocker disk encryption software enabled to prompt the user for their unencryption password after rebooting. This would affect both individal and corporate users. The problem is that many users might have enabled BitLocker without thinking about it too much or understanding what they were doing at all. As such, they may have no idea how to get their unencryption password.

The unencryption password can be obtained by logging into the Microsoft cloud account owning the operating system, finding the machine listed as a device, then drilling down into that machine to find the BitLocker password to supply to the machine. It's not the end of the world since recovery doesn't require a physical visit to the machine by an administrator and adoption of WIndows 11 is still low, both for consumer and corporate users. However, in a corporate setting, each fix will probably be a 3-5 minute phone call to an IT helpdesk and it's a complete halt to productivity until each user gets their unencryption password entered. For consumer users, it could take hours trying to understand the problem , find help online on exactly how to log into their online account, wade through the screens and find the magic unencryption password.


Fatal Intel Hardware Design Flaw

After months of speculation by various experts outside Intel, Intel itself now seems to be confirming at least part of an emerging design flaw in its latest 13th and 14th generation "Raptor Lake" architecture CPU chips. The problem is fatal... Fatal to the chip itself. In a nutshell, circuitry within the chip seems to be overdriving voltages delivered to chip internals, triggering damage that ultimately reduces voltage and current delivered to the chip, triggering functional failures between the chip and other components like memory, bus controllers, expensive graphics cards, etc.

At a more detailed level, all modern chips have circuitry called "vias" evenly distributed across the entire area of the chip to deliver voltage and current to the billions of transistors making up the chip. To optimize the chip to minimize power consumption, minimize heat generation and maximize processing when needed, logic in the chip monitors work loads in the chip and makes small adjustments to the voltages delivered to each core, raising the voltages (and heat / power consumption) under heavy load and lowering voltages under lighter processing demands.

The problem with the new Intel chips appears to be that microcode logic running within the chip that optimizes these voltage levels based on load has a flaw which results in the code running the chip with HIGHER voltages than either needed or expected by the overall design. In effect, these chips are essentially cooking themselves to death.

This isn't a code problem that causes faulty mathematical calculations in certain corner cases (like the Pentium issue in the 1990s) or security vulnerabilities that allow other code to intercept "pipelines" and steal data or alter instructions. As described by those outside Intel who have been investigating this issue, this problem will ultimately physcally destroy the chip.

Among the first to find the problem were game developers, who began seeing their own applications crash. Games are among THE most intenstive users of computing power and it is common for game applications to run CPUs at nearly 100% continuously . This usage would obviously cause the CPU chip to run "hotter" closer to its limits. However, game developers investigating their own code quickly found their failures began occuring even when NOT running their game. They then began realizing machines they were using for testing began crashing at nearly any other task within about three or four months of heavy use.

Intel has begun commenting on this issue, though curiously not initially on its own website. At the moment, Intel is conveying the failures are being caused by the microcode problem summarized above. This seems to indicate Intel believes that corrections to that microcode via a firmware patch might eliminate the problem.

Others are not so sure. One fear mentioned online is that if the problem is "influenced" by the microcode flaw but not ENTIRELY due to that microcode flaw, Intel might attempt to release a patch to the microcode to simply lower core voltages far below previously designed levels as a means to keep the chips as cool as possible...

...while the warranty period elapses before the other remaining design flaws continue to drive the chip toward failure when it is out of warranty. For chips costing between $550 and $670 dollars each, it's easy to question Intel's motives on possible fixes for this problem. Stepping up to provide a replacement CPU for every affected unit would cost not just the value of the new chip but the time by the consumer to swap the chip out, something many customers would not be comfortable doing on their own.

The release dates of the two chip series were October 2022 for the 13th gen and October 2023 for the 14th gen. Overall chip share is split roughly 78 / 13 / 8 between Intel, AMD and Apple. Intel's revenue from PC chips was about $40 billion for 2023 so even if these bleeding edge chips are only 15% of Intel's volume, this problem poses a threat to roughly $6 billion in revenue.

In the interest of full disclosure, I just bought a brand new deeeee-luxe system in April 2024 for $2900 to replace a pair of older machines dating from 2010 and 2011. At the heart of that new machine? An Intel i9 14900KF processor, one affected by this design flaw. I don't run it 24x7 and am not a gamer (at all) but it is on probably 8-10 hours per day and is running three or four virtual machines as various Linux images. So far, I have not experienced any crashes of any kind. I also have zero financial position in Intel stock.


As mentioned at the outset, this has not been a fun week for computer users or computer owners. These stories point out that the level of market concentration in hardware and software industries poses unique challenges. Businesses enjoy many economies of scale by only having one or two predominate desktop operating systems for employees to master to be productive. Businesses creating software benefit from only having to write their code for a very limited number of operating systems. It would seem individual owners enjoy the price reductions on such complex products only possible with massive scale and reduced variety.

However, all three of these failures are examples of "mono-culture" problems that are vulnerable to massive, expensive failures with little predictability beforehand. WIth technologies as complex as these, there's no magic wand to wave to prevent such issues. However, it's not clear that current regulations, market incentives and legal norms can ensure equitable outcomes when multi-billion dollar failures like these crop up.


WTH