Change Management: Lessons from the CrowdStrike Outage

Change Management: Lessons from the Microsoft and CrowdStrike Outage

Last week, businesses worldwide were significantly impacted by a major outage caused by a CrowdStrike update, leading to widespread disruptions and the notorious Blue Screen of Death (BSOD) on Windows computers. This event highlighted the intricate dependencies between software updates, cybersecurity measures, and downstream users. While CrowdStrike and Microsoft have shouldered much of the blame, the incident underscores a critical lesson for companies: the importance of effective change management and the need to adopt a cautious approach to non-emergency patching.

The Outage: A Perfect Storm

The incident began when CrowdStrike’s Falcon Sensor, a crucial cybersecurity tool, caused many Windows systems to crash globally. From a technical standpoint, this occurred because CrowdStrike uses a WHQL-certified driver to pass code directly into the Kernel (Ring 0) of the system. The problematic update contained flawed code that CrowdStrike’s error-catching mechanism failed to handle, causing the BSOD. Since the WHQL-certified driver is set as a boot-required driver, the most practical fix was to boot the system in safe mode and delete the update. This posed a significant issue for remotely managed physical Windows boxes, requiring either user intervention or on-site assistance. Even virtual machines required a technician to touch each instance. CrowdStrike acknowledged the issue and promptly rolled back the update. However, the damage had already been done, affecting numerous industries, from airlines to healthcare services. Microsoft also recognized the problem, with CEO Satya Nadella assuring customers of ongoing efforts to resolve the issue.

The Blame Game: A Complex Web of Responsibility

While CrowdStrike’s update was the immediate cause of the disruption, attributing blame solely to them overlooks the broader issue of change management. In a zero-trust environment, vendors must also be scrutinized. Microsoft patches have historically caused issues, which is why the best practice has always been to delay non-emergency updates for at least a week to observe any problems in other organizations. CrowdStrike should not have released an untested update, Microsoft should not have accepted the update without a testing period on machines they manage, and organizations should have implemented controlled rollouts and delayed updates.

Change Management: A Critical Necessity

Effective change management is essential in preventing widespread disruptions. Companies must ensure that updates, especially those involving critical systems, undergo rigorous testing before being rolled out. This involves a few key strategies:

Staggered Rollouts: Deploying updates gradually allows for testing in a controlled environment, helping to identify potential issues before they affect the entire network.
Comprehensive Testing: Rigorous testing protocols, including simulations of real-world scenarios, can help uncover problems that might not be apparent in a controlled environment.
Clear Communication: Open lines of communication between software providers, cybersecurity firms, and end-users ensure that everyone is aware of potential issues and how to address them quickly.
Robust Business Continuity and Disaster Recovery Plan: A well-defined BCP and disaster recovery plan offers a playbook for when an emergency strikes.

A Cautious Approach to Patching

One of the most crucial takeaways from the CrowdStrike and Microsoft outage is the need for companies to delay non-emergency patching. Here’s why:

Time for Testing: A delay of at least a week provides ample time to identify and resolve any issues that might arise from the update. This is especially important for patches that affect critical systems.
Risk Mitigation: By holding off on immediate patching, companies can observe the experiences of early adopters and make informed decisions based on their feedback.
Emergency Preparedness: In the case of emergency patches, having a well-defined protocol for rapid deployment ensures that critical updates are applied swiftly without compromising system integrity.

Take Away

The CrowdStrike and Microsoft outage serves as a stark reminder of the complexities involved in change management and the ripple effects of software updates. While it’s easy to point fingers, the incident highlights the collective responsibility of software providers, cybersecurity firms, and end-users. By adopting a cautious approach to non-emergency patching and implementing robust change management practices, companies can mitigate risks and ensure smoother transitions during updates. In an interconnected world, such proactive measures are not just advisable—they are essential.

How 365 IT Support Can Help

365 IT Support specializes in IT management, cybersecurity, security posture development, risk mitigation, and change management. Our team can help your company navigate the complexities of software updates and cybersecurity measures through:

Proactive IT Management: We ensure your IT infrastructure is robust and resilient, minimizing the risk of outages.
Enhanced Security Posture: Our comprehensive security solutions protect your systems from emerging threats.
Risk Mitigation Strategies: We implement best practices to identify and mitigate potential risks before they impact your operations.
Effective Change Management: Our structured approach to change management ensures smooth transitions during updates, with minimal disruption to your business.

By partnering with 365 IT Support, you can ensure your company is prepared for any IT challenges, keeping your operations running smoothly and securely.