Over the past 10 days, CrowdStrike and Microsoft have been working around the clock to help customers affected by the massive Windows BSOD issue caused by a faulty CrowdStrike update. Along with providing ways to fix the issue, CrowdStrike has already published its Preliminary Post Incident Review for this outage. According to their report, the BSOD was caused by a memory safety issue where their CSagent driver performed a read-out-of-bounds access violation.
Microsoft yesterday published their detailed technical analysis of this outage caused by the CrowdStrike driver. Microsoft"s analysis confirmed the findings of CrowdStrike that the crash was due to a read-out-of-bounds memory safety error in CrowdStrike"s CSagent.sys driver. The csagent.sys module is registered in a Windows PC as a file system filter driver to receive notifications about file operations, including the creation or modification of a file. This allows security products, including CrowdStrike, to scan any new file saved to disk.
When the incident happened, there was a lot of criticism around Microsoft allowing 3rd party software developers kernel-level access. In the blog post, Microsoft explained why they offer kernel-level access for security products:
- Kernel drivers allow for system-wide visibility and the capability to load early in the boot process to detect threats like boot kits and root kits, which can load before user-mode applications.
- Microsoft offers features such as system event callbacks for process and thread creation, file filter drivers, and more.
- Kernel drivers offer better performance for cases like high-throughput network activity.
- Security solutions want to ensure that their software cannot be disabled by malware, targeted attacks, or malicious insiders, even when those attackers have admin-level privileges. Windows offers Early Launch Antimalware (ELAM) early in the boot process for this reason.
However, kernel drivers also come with a tradeoff since they run at the most trusted level of Windows, increasing risks. Microsoft is also working to move complex Windows core services from kernel to user mode, such as font file parsing.
Microsoft recommends security solution providers balance needs like visibility and tamper resistance with the risk of operating within kernel mode. For example, they can use minimal sensors that run in kernel mode for data collection and enforcement, limiting exposure to availability issues. The rest of the features, like managing updates, parsing content, and other operations, can occur isolated within user mode.
In the blog post, Microsoft also explained the built-in security features of the Windows OS. These security capabilities offer layers of protection against malware and exploitation attempts in Windows. Microsoft will work with the anti-malware ecosystem through the Microsoft Virus Initiative (MVI) to take advantage of Windows built-in security features to further increase security along with reliability.
Microsoft has planned the following for now:
- Providing safe rollout guidance, best practices, and technologies to make it safer to perform updates to security products.
- Reducing the need for kernel drivers to access important security data.
- Providing enhanced isolation and anti-tampering capabilities with technologies like the recently announced VBS enclaves.
- Enabling zero-trust approaches like high-integrity attestation, which provides a method to determine the security state of the machine based on the health of Windows native security features.
While over 97% of Windows PCs affected by this issue are back online as of July 25, Microsoft is now looking ahead to prevent such issues in the future. John Cable, Vice President of Windows Program Management at Microsoft, recently published a blog post on this CrowdStrike issue where he mentioned that Windows must prioritize change and innovation in the area of end-to-end resilience, which is exactly what customers will expect from Microsoft.
Source: Microsoft