The Crowdstrike software update

leethalweapon · Jul 19, 2024

I was sitting at work, in a radio station in Melbourne Australia when the Crowdsstrike crash happened. Our entire network of stations was affected around Australia, some worse than others. How bad was it in America?

awsherrill · Jul 19, 2024

We use LogMeIn to remotely manage various PCs around our plants (southeast US). Apparently their servers were affected, because I was unable to remote in this morning to resolve an issue at one of our stations. Their service seems to be mostly restored now.

We had some other issues with things like FTP services not working right, which were possibly due to Cloudstrike problems. But nobody got knocked off the air and programming ran more or less normally, at least at our outlets.

leethalweapon · Jul 19, 2024

Radio stations across world disrupted

Kelly A · Jul 19, 2024

Crowdstrike pushed some updates that blocked disc access to Microsoft Azure cloud platforms. It's been fixed.

landtuna · Jul 19, 2024

leethalweapon said:
I was sitting at work, in a radio station in Melbourne Australia when the Crowdsstrike crash happened. Our entire network of stations was affected around Australia, some worse than others. How bad was it in America?

National news tonight reported major disruptions with airlines, hospitals and a host of retailers. They described it as the worst outage in recent history.

Will_H_69_9 · Jul 19, 2024

i know this impacted the local news stations this morning in Dallas as WFAA, KXAS & KTVT were doing without their graphics due to the failure of the computers brought down by Crowdstrike's botched update. i think KDFW was unimpacted by the time i checked in at almost 7 AM this morning.

TheBigA · Jul 19, 2024

My flight was cancelled.

leethalweapon · Jul 19, 2024

Microsoft IT outage: Australian airlines, banks and supermarkets begin return to normal operations

From the Guardian

davect · Jul 20, 2024

I work for a TV station. Fortunately, our on-air automation doesn’t touch the internet so that was fine. However, almost everything else was affected. We couldn’t air bugs or ID’s. We couldn’t go on the air with our 4:30am news because the news and production computers weren’t back up. Around 5am news had enough gear running to get something on the air. By 5:30am everything was more or less back to normal. Engineers had been working on it since 1am.

Kelly A · Jul 20, 2024

davect said:
I work for a TV station. Fortunately, our on-air automation doesn’t touch the internet so that was fine. However, almost everything else was affected. We couldn’t air bugs or ID’s. We couldn’t go on the air with our 4:30am news because the news and production computers weren’t back up. Around 5am news had enough gear running to get something on the air. By 5:30am everything was more or less back to normal. Engineers had been working on it since 1am.

That's interesting. Considering what was affected was related to Microsoft cloud services (Azure), how were the local station graphics systems, especially those involving local bug-inserters (Evertz, Chyron, etc.) affected at your station?
From what I understand, most negatively affected were applications that ran in the Azure cloud, or Microsoft Office 365 services that also live in the cloud.

Airlines were affected in so much that some use Azure cloud services for their various databases.

TheBigA · Jul 20, 2024

Kelly A said:
Airlines were affected in so much that some use Azure cloud services for their various databases.

Plus anytime you throw just a little delay into their system, it multiplies geometrically.

Kelly A · Jul 20, 2024

TheBigA said:
Plus anytime you throw just a little delay into their system, it multiplies geometrically.

Exactly. We've seen it before from Southworst and American when something IT hiccups during a high volume of travel. Domino effect.

Mark Roberts · Jul 20, 2024

First of all, anyone running automated updates on a mission-critical system should have their head examined. For radio and TV stations, this means playout systems and the like should be treated the same way utilities treat their control systems or the way banks and brokerages treat their customer-facing systems. Yes, automation is more more efficient but, especially with Windows, every environment has its own peculiarities due to the number of hardware and software configurations that are possible. The consequence is that adequate testing requires some time and effort. It costs money, but losing your customer-facing systems (among others) costs more. Otherwise, you're counting on your vendors to perform adequate testing. Did Crowdstrike do that before pushing yesterday's problem-laden code? One suspects they did not. The pressure in so many software development environments is to ship, and ship fast.

It's probably not widely realized that the systems that control power grids, natural gas distribution, etc., are running on Windows. But the management of those systems prioritizes availability. No EDR systems such as Crowdstrike are used, precisely for the reasons we all learned yesterday. To be more precise, EDR mucks about with kernel drivers, a big no-no in operational technology, where specialized vendor software is tightly integrated with the OS. Updates are made on a very slow cycle (6 months in the case of one major utility I worked for), scheduled well in advance, and always with the involvement of the vendors of the control software that sits on top of the OS, including extensive testing and review. So the problem isn't with Windows (mostly) - it's with the management of the systems.

My parenthetical of "mostly" relates to that kernel driver. Microsoft has to cryptographically* sign those drivers before they can be installed. This means that Microsoft also had a role in this debacle. Microsoft lately has come under fire for talking a good game about security without actually backing it up; this will just add to it, for it's apparent that Microsoft did very little testing or even checking of its own. (* = for avoidance of doubt, this has nothing to do with cryptocurrency)

Still, as with fraud prevention, vulnerability detection, etc., the costs of doing solid, careful engineering, risk management, and quality control are outweighed by the revenue to be gained by just pushing things out there and seeing what happens. This is where liability for saddling the technological ecosystem with inadequately engineered and tested software needs strengthening. One thing lawyers are really useful for is getting people to do things that they ought to do but otherwise wouldn't do because of cost...because the lawyers would cost them even more.

Kelly A · Jul 20, 2024

Mark Roberts said:
First of all, anyone running automated updates on a mission-critical system should have their head examined.

That's kind of a general statement that doesn't fit all instances. From what I read in the datacenter newsletters, CrowdStrike regularly pushes security patches in the background to reduce the possibility of being discovered, exploited or circumvented by bad guys. It kind of makes senses, when you think about it.
As it relates to this particular bug, drive access both hardware and virtual was blocked specifically using Windows OS.

Mark Roberts said:
For radio and TV stations, this means playout systems and the like should be treated the same way utilities treat their control systems or the way banks and brokerages treat their customer-facing systems. Yes, automation is more more efficient but, especially with Windows, every environment has its own peculiarities due to the number of hardware and software configurations that are possible. The consequence is that adequate testing requires some time and effort. It costs money, but losing your customer-facing systems (among others) costs more. Otherwise, you're counting on your vendors to perform adequate testing. Did Crowdstrike do that before pushing yesterday's problem-laden code? One suspects they did not. The pressure in so many software development environments is to ship, and ship fast.

Yeah CrowdStrike was the one who publicly and privately had to eat a giant s*it sandwitch. Which is a shame, because CS is one of the better cybersecurity organizations.

Mark Roberts said:
It's probably not widely realized that the systems that control power grids, natural gas distribution, etc., are running on Windows. But the management of those systems prioritizes availability. No EDR systems such as Crowdstrike are used, precisely for the reasons we all learned yesterday. To be more precise, EDR mucks about with kernel drivers, a big no-no in operational technology, where specialized vendor software is tightly integrated with the OS. Updates are made on a very slow cycle (6 months in the case of one major utility I worked for), scheduled well in advance, and always with the involvement of the vendors of the control software that sits on top of the OS, including extensive testing and review. So the problem isn't with Windows (mostly) - it's with the management of the systems.

But what about easter eggs or malware that attach to drivers and kernels? Happens all the time.

Mark Roberts said:
My parenthetical of "mostly" relates to that kernel driver. Microsoft has to cryptographically* sign those drivers before they can be installed. This means that Microsoft also had a role in this debacle. Microsoft lately has come under fire for talking a good game about security without actually backing it up; this will just add to it, for it's apparent that Microsoft did very little testing or even checking of its own. (* = for avoidance of doubt, this has nothing to do with cryptocurrency)

I think we've all been burned either by Microsoft security patches or driver signing. I got called to a TV station where the chief had let security patches install overnight before waiting and checking MS TechNet for anyone having problems. They called me in because the patches deleted the entire AD database for the entire station. Yep, Active Directory defaulted every IP to 192.168 generic IP's. Once I got everything rebuilt, I installed a NAS that backed up AD every 24 hrs in case they did it again.

Mark Roberts said:
Still, as with fraud prevention, vulnerability detection, etc., the costs of doing solid, careful engineering, risk management, and quality control are outweighed by the revenue to be gained by just pushing things out there and seeing what happens. This is where liability for saddling the technological ecosystem with inadequately engineered and tested software needs strengthening.

Yeah I know, blah blah, the outrage. Bottom line is; no mater whether it's Solarwinds, Microsoft, Cisco, or CrowdStrike, code and patches are written by humans on a deadline and installed in a way that is convenient to the customer. At least until something wrong is discovered minutes or hours later.

Mark Roberts said:
One thing lawyers are really useful for is getting people to do things that they ought to do but otherwise wouldn't do because of cost...because the lawyers would cost them even more.

Oh I'll bet that customer lawyers are talking to MS and CS lawyers over this recent incident.

Mark Roberts · Jul 20, 2024

Mark Roberts said:
First of all, anyone running automated updates on a mission-critical system should have their head examined. For radio and TV stations, this means playout systems and the like should be treated the same way utilities treat their control systems or the way banks and brokerages treat their customer-facing systems.

Kelly A said:
That's kind of a general statement that doesn't fit all instances.

Of course it doesn't fit all circumstances. What's required is an assessment of whether the "risks" (really, vulnerabilities) that a third-party platform you have no control over aims to prevent or remediate are greater than those caused by the functioning - or, in the case, the malfunctioning - of that platform. Based on what we saw yesterday, I think some organizations actually did this. For example, Amazon sent me a message saying some deliveries might be a day late due to the Crowdstrike outage, yet I was still able to place a couple of orders yesterday. (I guess they're not using AWS for everything!) The issue comes about in smaller organizations, and/or ones that are less diligent about considering the risks of third-party services, who then look for a solution that can be applied across the board.

Kelly A said:
From what I read in the datacenter newsletters, CrowdStrike regularly pushes security patches in the background to reduce the possibility of being discovered, exploited or circumvented by bad guys. It kind of makes senses, when you think about it.

There's a saying: "security through obscurity" which has the implication that obscurity isn't enough to protect you. Sure, you don't want to tell more than you have to, but for an enterprise that has systems where availability needs to be the top priority, that balance is going to be different. One size doesn't fit all, despite some security pros' (or CISO or CIO) efforts to try to do that. Tools aren't the only solution. They can help, but it's an issue of managing the environment.

Kelly A said:
As it relates to this particular bug, drive access both hardware and virtual was blocked specifically using Windows OS.

Right, it was a kernel driver. The fix, as it stands now, is to boot into safe mode and then manually delete the offending driver. A "devmod remove" probably isn't enough, since the behavior with the file present is to get stuck in a loop, where you can't even get to the command prompt. In any event, I feel for the technical staff who are going to burn up their overnights and weekends for the foreseeable future. I don't think they get paid enough to put up with this nonsense.

Kelly A said:
Yeah CrowdStrike was the one who publicly and privately had to eat a giant s*it sandwitch. Which is a shame, because CS is one of the better cybersecurity organizations.

They were. Then again, I wouldn't want to overstate the long-term outcomes when reputational risk becomes real. Risk managers overstate that (I know I did once upon a time) but the reality is that the stink of the sandwich en merde wears off fairly quickly and top management will go back to watching the stock price closely.

Kelly A said:
But what about easter eggs or malware that attach to drivers and kernels? Happens all the time.

In the operational technology world (e.g. Schneider Electric, Rockwell) there are significant interfaces with physical devices (sensors, PLCs, and so on). The OS version-control software versions are tightly controlled. They're also running in environments that, at most, will have heavily mediated network access* and with direct external access only to the vendor. Yes, there is trust of a third-party just as there is with Crowdstrike, but the incentives are different. In the utility field, you simply do not screw up, or you're dead. General IT is more forgiving. Maybe it shouldn't be.

(* = though I have seen some pretty terrible Citrix implementations, complete with allowing network-drive mounts to a supposedly protected network. Yikes!)

Kelly A said:
Yeah I know, blah blah, the outrage. Bottom line is; no mater whether it's Solarwinds, Microsoft, Cisco, or CrowdStrike, code and patches are written by humans on a deadline and installed in a way that is convenient to the customer. At least until something wrong is discovered minutes or hours later.

Oh I'll bet that customer lawyers are talking to MS and CS lawyers over this recent incident.

In some cases, such conversations will be at the CEO level.

Can't wait for the congressional kabuki, um, hearings.

Kelly A · Jul 20, 2024

Mark Roberts said:
Can't wait for the congressional kabuki, um, hearings.

Much of that will depend on how many Congresspersons or their staff missed their flights. And of course, what human being can we publicly burn at the stake? And this of course won't be the last time some public-facing technology issue will happen. Chances are a much larger incident will happen someday that will actually cripple the entire public Internet for longer than anyone would want.
The question from some politician in a hearing will be the same: 'How can we make sure something like this never happens again?'
Yeah, right.

TheBigA · Jul 20, 2024

Kelly A said:
Much of that will depend on how many Congresspersons or their staff missed their flights.

It was the Friday after the RNC, and there were lots of repubs in Milwaukee. Haven't seen any stories about the situation there.

They all love to attack big tech, but there really isn't a famous face they can blame this on.

Personally I have spent a layover day in Milwaukee and it can be a really nice place.

Kelly A · Jul 20, 2024

TheBigA said:
It was the Friday after the RNC, and there were lots of repubs in Milwaukee. Haven't seen any stories about the situation there.

FlexJet and NetJets were very busy the past couple weeks.

TheBigA said:
They all love to attack big tech, but there really isn't a famous face they can blame this on.

They aren't attacking tech now that Elon and others have announced not only their endorsements of Trump, but are donating a ton.

TheBigA said:
Personally I have spent a layover day in Milwaukee and it can be a really nice place.

I can think of a lot of airports worse than Mitchell. There are likely a lot of available nice hotel rooms at a decent rate now that the RNC is over.

Mark Roberts · Jul 20, 2024

Kelly A said:
The question from some politician in a hearing will be the same: 'How can we make sure something like this never happens again?'

Well, you don't. But try talking probability or risk to most people. The language risk-management geeks use doesn't help, but look how long it took humankind to develop any sort of concept of probability (the 17th century).

Congressional hearings rarely are very useful to finding anything out that isn't already known.

boombox4 · Jul 21, 2024

A question for you tech-savvy guys: Does this incident say anything about the movement of everything -- or the apparent push to move everything -- to "the Cloud"? Do station clusters still have their own, local data storage, or is that increasingly on the "cloud", too?