Crowdstrike Incident explained for the average Joe.


Recommended Posts

By now, you’ve probably heard the news about the CrowdStrike software update that took down computer systems worldwide.

CrowdStrike is a US-based company whose software helps protect computer systems from being hacked.

The version of software affected ran on Microsoft Windows, but Microsoft did not cause the crash; a CrowdStrike update did.

To understand what happened, we must first look at how Windows is designed.

Think of Windows as the earth. On the surface of the earth is where everything lives: your apps, your files, and most of what you do on your computer.

Then down deep in the center of the Earth, we have "The core". In Windows, it's called “The Kernel.”

The kernel is a very low-level part of Windows that is responsible for its overall operation.

An example of something that runs inside the kernel is drivers, which allow, for example, your computer to output a display to your screen.

Very few Applications run inside the kernel. Because the kernel is such a low-level part of the system, any software bugs there can cause the entire system to crash.

If you are a security systems provider like Crowdstrike or an Antivirus company, you need to work underneath all the other files to protect the system from hacking or viruses. You need to be in the kernel.

As a piece of software running in “The core” of the operating system, you want to be careful. This is why Crowdstrike didn’t update that piece of software very much. They left well enough alone.

Instead, they would release daily updates outside the kernel, much like how your Antivirus downloads daily definition updates, telling it what new viruses to look for.

The problem on Friday is that Crowdstrike released one of those daily updates that, for some reason, was empty. The file was all 0s and the software they had been running in the kernel had a bug:

When it tried to process a file with all 0s, it crashed, and because it was running in the core, when Windows tried to boot, it would crash, too.

To fix a computer with the issue, you had to find that file with all 0s and delete it.

While that sounds easy and is pretty simple in most cases, it requires someone to be physically in front of the computer.

This is because a LOT of remote access software requires Windows to be running, and in this case, Windows won't boot.

Now, imagine computer systems scattered across the country that are typically managed remotely, sometimes from hundreds of miles away, or an organization with this software on 10,000 to 200,000 computers. All taken down.

The scenario above played out worldwide on 8.4 million computers, crashing each one.

I hope this gives you more insight into how this update caused so much trouble.

Link to comment
Share on other sites

  • +Warwagon changed the title to Crowdstrike Incident explained for the average Joe.
On 22/07/2024 at 11:14, SarK0Y said:

no proper staging to roll thing out..  yeah, it was funny :)

Yes it was hilarious. 🙄

Link to comment
Share on other sites

"This is why Crowdstrike didn’t update that piece of software very much. They left well enough alone."

...Only several times per day depending on the channel update policy set. The busted channel file (C-00000291-00000000-00000032.sys) was the cause of the outage, the later revision (> C-00000291-00000000-00000033.sys) works as expected.

image.png.d676ecd7e60638b0845f71088f707b6b.png

image.thumb.png.8bda32afe68c76558b72ac870013f032.png

Link to comment
Share on other sites

On 22/07/2024 at 19:45, binaryzero said:

"This is why Crowdstrike didn’t update that piece of software very much. They left well enough alone."

...Only several times per day depending on the channel update policy set. The busted channel file (C-00000291-00000000-00000032.sys) was the cause of the outage, the later revision (> C-00000291-00000000-00000033.sys) works as expected.

image.png.d676ecd7e60638b0845f71088f707b6b.png

image.thumb.png.8bda32afe68c76558b72ac870013f032.png

I said they didn't update the kernel code much. The definition files are executed by the driver in kernel mode, but the code that would have been sent to Microsoft to get certified hasn't been updated much.

 

Link to comment
Share on other sites

The sensor driver (csagent.sys - the signed file Dave is referring to) wasn't updated,  the definition file when parsed is what caused the machine to crash. 

Link to comment
Share on other sites

On 22/07/2024 at 19:58, binaryzero said:

The sensor driver (csagent.sys - the signed file Dave is referring to) wasn't updated,  the definition file when parsed is what caused the machine to crash. 

More or less, that's what I said in the written piece above. The goal was not to get too technical.

  • Like 2
Link to comment
Share on other sites

On 23/07/2024 at 03:45, binaryzero said:

"This is why Crowdstrike didn’t update that piece of software very much. They left well enough alone."

...Only several times per day depending on the channel update policy set. The busted channel file (C-00000291-00000000-00000032.sys) was the cause of the outage, the later revision (> C-00000291-00000000-00000033.sys) works as expected.

the very question: how faulty version was rolled out :)

Link to comment
Share on other sites

I'm shocked at the lack of bounds checking when parsing something that is essentially executed. 

 

 

Link to comment
Share on other sites

Crowdstrike are not going to survive the legal fallout of this monumental screw up.  No EULA is going to save them from their incompetence.  This outage caused BILLIONS in damage and I won't be surprised if we find out people died because of the failure of various 911 systems around the world, not to mention surgery cancellations and all sorts of other stuff, all because some dipstick didn't QA their update properly. 

 

Link to comment
Share on other sites

On 29/07/2024 at 06:00, FloatingFatMan said:

Crowdstrike are not going to survive the legal fallout of this monumental screw up.  No EULA is going to save them from their incompetence.  This outage caused BILLIONS in damage and I won't be surprised if we find out people died because of the failure of various 911 systems around the world, not to mention surgery cancellations and all sorts of other stuff, all because some dipstick didn't QA their update properly. 

 

eh they'll be fine, they have outs

Link to comment
Share on other sites

On 29/07/2024 at 12:41, neufuse said:

eh they'll be fine, they have outs

All they have are their terms of use, and they're not going to stand up against the wealth of international legal hell that's heading their way.

Link to comment
Share on other sites

On 29/07/2024 at 04:03, micko68 said:

Not true unless the crash had already happened. Pushing out a script to delete the files using an RMM fixed the issue before it occurred.

Didn't get to do this myself (currently an unemployed bum) but previously placed I worked did.

The outages that occurred from this are because the crash had already happened. Machines that were offline didn’t receive the update. 

Link to comment
Share on other sites

On 29/07/2024 at 07:43, FloatingFatMan said:

All they have are their terms of use, and they're not going to stand up against the wealth of international legal hell that's heading their way.

eh, come back in a year and we will see but I bet they will be fine, at the worst they merge with someone else

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.