Microsoft Sees 8.5M Systems Hit by Faulty CrowdStrike Update
Cybersecurity Vendor Reports 'A Significant Number Are Back Online and Operational'Microsoft said 8.5 million Windows hosts were affected by the Friday outage caused by a faulty CrowdStrike software content update.
See Also: JAPAC | Secure Your Applications: Learn How to Prevent AI-Generated Code Risk
This represents "less than 1% of all Windows machines," said David Weston, Microsoft's vice president for enterprise and OS security, in a blog post.
Even so, this appears to be the largest IT outage in history. "While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services," Weston said. Cue massive disruptions, including for doctors' offices and hospitals, banks and stock exchanges, stores, hundreds of thousands of travelers in the form of flight and railway cancellations, plus public safety concerns, border crossing issues and much more chaos.
As of Monday, many organizations said that while the disruptions are lessening, they're still working through the resulting backlogs (see: CrowdStrike, Microsoft Outage Uncovers Big Resiliency Issues).
Technical details CrowdStrike released Saturday said "the trigger" for the outage occurred Friday at 04:09 UTC, when as part of regular operations, the company pushed an updated configuration file to the threat detection component of its customers' Falcon endpoint detection and response software. The file begins with C-00000291-
and ends with .sys
and is not a kernel driver, but rather a threat signature file for the company's sensor.
"Sensor configuration updates are an ongoing part of the protection mechanisms of the Falcon platform," the vendor said. "This configuration update triggered a logic error resulting in a system crash and blue screen - BSOD - on impacted systems," referring to the Windows crash error known as the "blue screen of death." The company quickly updated the faulty file to correct it and has been working to help affected organizations deal with the fallout.
"CrowdStrike continues to focus on restoring all systems as soon as possible," the company said in a Sunday post on LinkedIn. "Of the approximately 8.5 million Windows devices that were impacted, a significant number are back online and operational."
CrowdStrike added: "Systems running Linux or macOS do not use Channel File 291 and were not impacted."
Numerous IT administrators report working nonstop through the weekend to try and restore affected Windows systems. In best-care scenarios, systems only need to be rebooted for Windows to automatically excise the bad version of the Channel File 291. For many Windows hosts, the fix is more involved.
The work is not glamorous: "it is thousands and thousands of IT professionals going, literally, system by system and deleting the problematic file," one admin said in a post to Reddit.
For its part, Microsoft said it has deployed hundreds of its "engineers and experts to work directly with customers to restore services," and that it's liaising with other cloud providers, including Google Cloud Platform and Amazon Web Services, to try and help speed recovery.
As recovery continues, the U.S. Cybersecurity and Infrastructure Security Agency has warned users to beware criminals' attempts to turn the outage to their advantage, including via social engineering attacks. "Cyber threat actors continue to leverage the outage to conduct malicious activity, including phishing attempts," CISA said (see: Fake Websites, Phishing Surface in Wake CrowdStrike Outage).
Full Recovery Still to Come
Experts said fully recovering all affected systems may take weeks.
Some tools and workarounds do appear to be helping to speed the process for at least a subset of organizations that use CrowdStrike.
On Sunday, Microsoft - working with CrowdStrike - released an updated recovery tool, designed to work with Windows clients and servers, as well as Hyper-V virtual machines, which involves a bootable USB drive. Download a signed version of the recovery tool from Microsoft's download center. Microsoft said admins will need "a BitLocker recovery key for each BitLocker-enabled impacted device on which the generated USB device will be used."
The tool offers two potential recovery options:
- Windows Preinstallation Environment: This lightweight Windows PE version of the operating system is designed for making repairs and doesn't require local administrator privileges, although "if the device uses BitLocker, you may need to manually enter the BitLocker recovery key before you can repair an affected system," Microsoft said.
- Safe Mode: "This option to recover from safe mode may enable recovery on BitLocker-enabled devices without requiring the entry of BitLocker recovery keys," Microsoft said. "You need access to an account with local administrator rights on the device." This will work with devices that use a Trusted Platform Module to protect the system, although a user may have to enter their TPM PIN code if one has been set. It will not work with any encrypted disks.
One shortcoming is that "if the device is encrypted this tool won't be able to bypass that," said cybersecurity expert Brian Honan in a post to social platform Mastodon.
Hence for systems configured to use full-disk encryption - as cybersecurity experts have long recommended doing - the recovery process is typically much more laborious and frustrating. Exact requirements can vary based on the type of IT management software being used, what type of full-disk encryption is in place and more.
For many users, a copy of their BitLocker recovery key will have been saved by default to their Microsoft account and can be accessed by logging into the account from a different computer, which will typically be their smartphone. In some cases, admins can also compile lists of BitLocker keys. The next step would involve having IT personnel boot each system in person, using a USB key running the Microsoft utility and then entering the BitLocker recovery key on a per-system basis.
Some admins report success in finding ways to speed the recovery process inside their own environment - for example, by using a Preboot Execution Environment for network-connected systems and involving end users.
"I created a Script + Process for enabling end-user self-service of BitLockered machines still affected by this incident," one admin posted to Reddit. "This will allow you to send out instructions for your end-users to PXE boot and then sit for a minute while their PC automatically runs a task sequence that will unlock BitLocker + fix the issue on the OS volume and boot them back into a working OS."
Accelerated Technique Promised
Could other time-saving approaches be forthcoming?
On Sunday, CrowdStrike previewed the release of "a new technique to accelerate impacted system remediation" that it's been testing with customers. "We're in the process of operationalizing an opt-in to this technique. We're making progress by the minute."
CrowdStrike on Monday further updated its management software to give administrators the ability to identify all Windows hosts that still have the defective sensor configuration update. To access that view, it said to access the CrowdStrike dashboard and then navigate to: Next-Gen SIEM > Log management > Dashboards
.
Some admins have been reporting that the dashboard's various iterations haven't been returning reliable results, in part because it appears to be incompletely cataloging all of the various types of systems that may be involved.
"I sure wish this custom dashboard thing would stop changing," one admin posted to Reddit. "I'm being asked to report numbers on this situation and they go up, they go down, they go back up again."
The admin added: "It's difficult when this product has stained my reputation at my job (fair or not) and I cannot even produce consistent metrics."