Last week, an estimated 8.5 million Windows PCs got hit by a buggy CrowdStrike Falcon sensor software update which lead to BSODs (blue screen of deaths) on such affected systems with an error message "csagent.sys (PAGE_FAULT_IN_NONEPAGED_AREA)." The effect was felt across sectors with airlines like Delta having to cancel hundreds of flights.
Crowdstrike Falcon sensor SOAR (Security Orchestration, Automation and Response) is an endpoint security solution by the firm intended to prevent malware and various cyberattacks.
Realizing that recovering from such a massive outage was not going to be easy, Microsoft pointed towards its guidance about restoring business and enterprise systems to an earlier working point as a temporary workaround before publishing a recovery tool.
CrowdStrike also offered comprehensive workarounds for the issue that we covered in a dedicated piece. Following that, the firm published a new "Remediation and Guidance Hub" support page where it explained various points on how to deal with the issue such that IT and system admins could find all that they needed in one place.
CrowdStrike also announced a new technique that it was testing to deploy in order to accelerate the recovery of affected systems. The cybersecurity firm has now published a Preliminary Post Incident Review (PIR) of the global outage on its Remediation and Guidance Hub page which details the incident.
In a nutshell, the problem was a Rapid Response Content update that had a buggy InterProcess Communication (IPC) Template Type which was incorrectly validated during testing. The botched IPC is essentially what led to the consequent mayhem:
What Happened?
On Friday, July 19, 2024 at 04:09 UTC, as part of regular operations, CrowdStrike released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques...
Systems in scope include Windows hosts running sensor version 7.11 and above that were online between Friday, July 19, 2024 04:09 UTC and Friday, July 19, 2024 05:27 UTC and received the update. Mac and Linux hosts were not impacted.
The defect in the content update was reverted on Friday, July 19, 2024 at 05:27 UTC. Systems coming online after this time, or that did not connect during the window, were not impacted.
What Went Wrong and Why?
CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.The issue on Friday involved a Rapid Response Content update with an undetected error.
[..]
What Happened on July 19, 2024?
On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.
When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).
You can read more about it here on CrowdStrike"s official guidance page.