What Microsoft’s Blue Screen of Death Teaches Us: Reliability through Test Automation
July 24, 2024
|
Our dependence on technology became evident on July 19, 2024, with around 70% of the Fortune 100 companies and major airlines, banks, healthcare, media, and emergency services staring at the Blue Screen of Death (BSOD). This software outage is the largest ever in history; the most recent was the WannaCry ransomware attack on May 12, 2017, which impacted more than 200,000 computers in 150 countries.
Microsoft said that the outage was caused by an update from CrowdStrike, a well-known cybersecurity company. Surprisingly, this CrowdStrike outage is historical since 8.5 million laptops and workstations crashed at the same time, and the global impact was huge. Instances are there where airlines issued handwritten boarding passes to the passengers to keep their services on, and the 911 emergency services got badly impacted.
Can you imagine this in today’s world of generative AI, NLP, and intelligent AI agents? However, we all know such things happen in the software industry, and human intelligence is in learning through adversity rather than pondering on the issue itself.
In this article, we will discuss why this outage occurred and what we can learn from it to prevent such events from happening again.
What is BSOD?
BSOD stands for “Blue Screen of Death.” It is an error screen displayed on Windows computers following a fatal system error, also known as a system crash. This error indicates that the operating system has encountered a critical issue that it cannot recover from without rebooting.
Common Causes of BSOD
- Hardware Problems: Issues with computer hardware, such as failing RAM, hard drives, or overheating.
- Driver Issues: Corrupt or incompatible drivers can lead to BSOD.
- Software Conflicts: Incompatible software or system files can cause crashes.
- Malware or Virus: Malicious software can cause severe system errors.
- System File Corruption: Corrupted system files or registry issues can lead to BSOD.
Why did CrowdStrike Outage Happen?
CrowdStrike is a top cybersecurity company that helps protect businesses from online threats. They use advanced technology to detect and stop cyber attacks in real-time. Not to forget that in January 2024, it had been recognized as the only Customers’ Choice in the 2024 Gartner Peer Insights™ Voice of the Customer for Vulnerability Assessment.
CrowdStrike’s product Falcon is an “Endpoint Detection and Response” (EDR) software that monitors the endpoints (computers, PCs, tablets, etc.) on which it is installed for signs of any threat (such as malware/ cyberattack). When it detects malicious activity or software, it helps to lock down the threat. Since it is end-point security software, it needs to run with low-level access to protect the computer.
For instance, if Falcon identifies that a monitored computer is communicating with a potential hacker, it is capable of terminating that communication immediately. Consequently, Falcon is deeply integrated with the core software of the computers on which it operates. Therefore, instead of just crashing itself, it locked and crashed the whole computer, i.e., Microsoft Windows, and we see the Blue Screen errors/BSoD/black screen errors/STOP code error. This error occurs if a serious problem causes Windows to shut down or restart unexpectedly. It might say, “Windows has been shut down to prevent damage to your computer“, or a similar message.
In addition, cybersecurity software generally receives frequent updates to keep itself up-to-date and identify vulnerabilities. These companies have a robust and powerful quality assurance process in place to combat any such incident since the impact can be overwhelming. They test a huge number of software combinations to avoid crashes.
However, as we can see, CrowdStrike’s quality assurance processes failed in this particular incident. Crowdstrike, with their channel configuration file update, crashed every machine.
What Happened to the Falcon Sensor?
CrowdStrike uses a small piece of software called a sensor installed on computers, which collects data and looks for signs of attacks. So, on 19th July 2024, CrowdStrike released a sensor configuration update to Windows systems. These configuration files are known as Channel Files, which reside here: C:\Windows\System32\drivers\CrowdStrike\
.These updates to channel files happen several times a day and are an ongoing part of the protection mechanisms of the Falcon platform. A faulty update (logic error) caused conflicts with Windows because of a bug in their Content Validator, which is CrowdStrike’s own internal code-testing software.
CrowdStrike CEO George Kurtz mentioned, “CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated, and a fix has been deployed. ”
What are the Workarounds?
Workarounds for the crash are available, and they require manual intervention. You need people to type commands on the keyboard and resolve the issue on each machine. Hence it is a time-consuming process. It includes rebooting Windows devices into Safe Mode or the Recovery Environment and manually removing the problematic file. CrowdStrike has provided manual workarounds for individual hosts, public clouds, AWS, and Azure environments, which can be used to bring back the machine in order. However, this process is particularly challenging for virtual machines such as Windows 365 Cloud PCs, which require additional steps due to their virtual nature.
Also, Microsoft has also released a custom WinPE recovery tool to find and remove the faulty CrowdStrike update.
Role of Test Automation: What are the Learnings?
This incident was not a cyberattack. However, it is a huge event because of its impact on the ‘Availability’ part of the CIA (Confidentiality, Integrity, Availability) triad. So, it is time to think about the lessons that we all should learn from this incident because everyone’s life was impacted by this incident in some way or another.
Now, we can clearly see here that there was a testing lapse from CrowdStrike’s side. The configuration file updates were not properly tested but deployed on all the machines at once (an incremental rollout would have saved the day!).
When there underlies such a high magnitude of impact, these testing practices are a must:
1. Utilizing CI/CD Pipelines:
- Automated Testing in CI/CD: Integrating automated testing into CI/CD pipelines, ensures that every code change, including configuration updates, is tested rigorously before deployment. This includes unit tests, integration tests, and system tests to validate that changes do not introduce errors. Here are the top 7 CI/CD tools for 2024.
- Early Detection: Automated tests can detect issues early in the development cycle, reducing the risk of deploying faulty updates. For instance, a comprehensive set of automated tests covering various scenarios and edge cases could have caught the logic error that led to the BSOD. Here is an article about the impact of late bug detection, and one example the whole world has witnessed.
Here are the Top 7 Automation Testing Tools to Consider.
2. Mandatory Regression Testing:
- Automated Regression Tests: Implement automated regression tests to ensure that new updates do not break existing functionality. These tests run a suite of previously validated tests against the new code to ensure stability.
- Preventing Recurrence: By continuously running these tests, any regression in functionality, such as the one caused by the problematic Channel File 291, would have been identified and rectified before release.
3. Simulated Real-World Scenarios:
- Test in Diverse Environments: Automated testing can simulate a wide range of real-world scenarios, including different hardware configurations, operating systems, and network conditions. This comprehensive testing environment can identify potential issues that might not be evident in a limited manual testing setup. Know more about Managing Your Test Environment.
- Stress and Load Testing: Automated tools can simulate heavy loads and stress conditions to ensure the software remains stable under extreme conditions. This type of testing could have helped identify how the update behaves under various system loads and conditions.
4. Automated Rollback Mechanisms:
- Instant Rollback: Before you deploy changes, have the rollback plan ready. Automation in deployment processes can include using version control systems and automated rollback mechanisms that can revert to the previous stable state if an update fails. This would minimize downtime and prevent widespread issues like the BSOD from affecting a large number of systems. Read How to Do Version Controlling in Test Automation.
5. Automated Monitoring and Alerts:
- Real-time Monitoring: Automated monitoring tools can detect abnormal behaviors in real-time and alert the development team to potential issues. These tools can identify problematic updates before they reach a critical point, allowing for quick intervention.
It is critical to monitor the changes from the infrastructure layer to the application layer and find issues before they get deployed to the end user. Advanced monitoring tools, anomaly detection systems, and proactive alerting mechanisms can help. Read Understanding Test Monitoring and Test Control.
- Log Analysis and Diagnostics: Automated log analysis can help diagnose issues more effectively, providing insights into the root cause of problems like the one experienced with the BSOD update. Here is a test log tutorial.
6. Release/Change Management:
- Test and Release: All application changes should be documented, tested, approved, and then only released. This will help identify the issue with any change when it was released, what the change was, and a faster rollback to attain stability again. Learn these Continuous Integration and Testing: Best Practices.
- Change Management Processes: Establish a clear and well-defined change management process. Before applying any software update, conduct thorough impact assessments and maintain system stability.
7. Thoughtful Vendor Engagement
- Third-party Vendors: Any fixes provided by vendors must be thoroughly tested in a controlled environment to ensure they do not introduce new issues or vulnerabilities. Also, the operating system should consider how much low-level access third-party vendors should have. Even if the access is there for third-party software, the OS needs to be able to roll back or reject any faulty third-party deployment.
- Cloud Service Providers: Check the security, compliance, and reliability of cloud service providers thoroughly. Look at their track record, certifications, and independent audits before deciding. Here is a detailed Cloud Testing Guide.
Conclusion
This BSOD incident truly highlights the importance of robust software testing measures and the huge impact of technology in our daily lives. As Microsoft cybersecurity executive David Weston said, “It’s also a reminder of how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist.”
Cyber-security experts and agencies worldwide have issued warnings about an increase in opportunistic hacking attempts linked to this IT disruption. We can clearly see we are interconnected as global cloud providers, software platforms, security software vendors, third-party vendors, and customers.
This incident reminds us to have robust testing processes, governance practices with clear guidelines, protocols, and accountability frameworks. The bottom line is that organizations need better disaster recovery plans, compliance adherence, and risk management to maintain business continuity.
Achieve More Than 90% Test Automation | |
Step by Step Walkthroughs and Help | |
14 Day Free Trial, Cancel Anytime |