AI Cybersecurity Exploits: The Security Flaw Traditional Testing Failed to Catch
|
|

We often assume that a strong password and 2FA are enough to protect our accounts. After all, who would think it’s possible for someone to hack an account protected by both?
But somewhere in the background, an AI model is analyzing the source code of the admin tool our office team uses regularly. This isn’t a hacker or a foreign agent looking at the code while we’re awake. It’s simply a program that understands code logic, much like a senior developer reviewing a pull request. The program finds a small loophole that no normal scanner can catch. It is a small mistake that was introduced when the code was written. A small flaw in the code could allow an attacker to bypass two-factor authentication.
A recent Google Threat Intelligence Report noted that a criminal group used AI to discover a zero-day vulnerability and attempted to compromise a system. This is the first time in history that something like this has been reported. Their target was a popular open-source web admin tool. Although they planned to use it in a major attack campaign, Google discovered it before the attack could begin, notified the company, and fixed the bug.
What if something like this happens before anyone tracks it the next time? We really need to think about it.
| Key Takeaways: |
|---|
|
The Incident in Detail
Since it was widely reported in the media, we need to know the exact mechanics behind it.
A cybercriminal group fed the source code and documentation of a popular open-source admin tool to a large language model (LLM), establishing a functional LLM vulnerability discovery pipeline. The AI discovered a small flaw in the application’s authentication logic. In other words, if certain requests arrived in a specific order, the system would trust them without enforcing 2FA. It was a semantic logic bug that could not be detected by conventional security scanners. It worked exactly as the developers had intended, but the AI discovered an edge case that they hadn’t thought of.
Next, they asked the AI to write a script to hack the system using that loophole. Google’s Threat Intelligence Group (GTIG) was highly confident that the script was generated by an AI model rather than a human. It’s not because of its poor quality, but because of some comments it contained. LLMs often provide extensive explanations (comments) in the code to make it easier for ordinary learners to understand, and this script also had a lot of comments. Not only that, but it also had hallucinated CVSS scores created for a non-existent CVE identifier. Just by looking at the structure of the code, you can easily identify that it came from pure AI training data.
Real hackers never put such unnecessary comments in their code. But if you look at this script, it was like reading a tutorial. Anyway, before this hacking group started using it, Google informed the company and fixed the security flaw.
How This Is Different From Previous AI Security Concerns
AI-assisted attacks are nothing new. Groups like North Korea’s APT45 have been around for a long time. They use AI to recursively analyze a wide range of vulnerabilities and validate proof-of-concept exploits. While Chinese actors have shown significant interest in exploiting AI to discover vulnerabilities, Russian groups used AI to create fake code to avoid detection by scanners.
But those were cases where AI was used to speed up tasks that humans would normally do.
However, that was not the case this time. In this case, they leveraged a zero-day vulnerability. That is, a vulnerability that no one in the world knew about until hackers discovered it. The AI didn’t help exploit an existing bug. Instead, the AI itself identified a loophole that conventional tools could not detect., shifting our focus toward real-world AI cybersecurity exploits.
Ryan Dewhurst, Head of Threat Intelligence at watchTowr, put it this way:
“AI is already accelerating vulnerability discovery, reducing the effort needed to identify, validate, and weaponize flaws. This is today’s reality: discovery, weaponization, and exploitation are faster. We’re not heading toward compressed timelines. We’ve been watching the timelines compress for years. There is no mercy from attackers, and defenders don’t get to opt out.”
The Flaw That Testing Should Have Caught
This is where QA teams need to pay more attention to the mechanics behind this open-source system administration tool hack.
The 2FA bypass is not a buffer overflow or a common SQL injection. It is not the type of vulnerability that can be detected using a fuzzer or SAST tools. Google’s GTIG described it as “a logic error that appears functionally correct but remains strategically broken from a security perspective.” This was something that QA teams could potentially have identified.
Logic bugs like this occur when the code doesn’t do what it’s supposed to do. Just because the code looks correct at first glance doesn’t mean it will be caught during a regular code review. The main flow works correctly, and the unit tests pass. Such bugs only appear when you test in a specific order that the developers didn’t think of.
Usually, SQL injection and buffer overflow vulnerabilities can be detected to some extent through static analysis or functional tests. However, these logic bugs, as well as hardcoded trust assumptions, are not typically covered by such scanners or functional tests.
Therefore, the most important approaches here are adversarial testing, boundary condition testing, and authentication path regression testing. Only such testing methods can reveal these logic bugs. However, most teams avoid these types of tests because of time pressure and project deadlines.
What Every QA Team Should Learn from This AI-Discovered Security Flaw
Software teams are now moving much faster than ever. AI enables teams to release products more quickly. While this is good, it also comes with a big risk: the possibility of logic-level bugs in the code. Additionally, the time between a release and the discovery of vulnerabilities by attackers has become significantly shorter, leaving the door open to potential mass cyber exploitation operations.
According to the 2026 CrowdStrike Global Threat Report, there has been an 89% increase in AI-enabled adversary activity compared to the previous year. Zero-day attacks have increased by 42%. This report reflects the reality of the market in which we are currently releasing software, driven by complex adversarial AI workflows.
Here are some important lessons QA teams should learn from this:
- Test authentication from an attacker’s perspective: If you have any special exemptions or conditions in your 2FA logic, you should test them thoroughly. It’s not enough to just see if those conditions work. You should also test how the system will react if someone tries to manipulate the system by changing the order of those requests.
- Don’t blame the security team for logic errors: The bug in this case is not the security team’s fault. It was a design mistake that occurred when the code was written, and no one tested it thoroughly. If such tests are included early in development and QA (shift-left testing), we can fix these issues before hackers find them.
- Test authentication regressions after every release: When developers make a small change to the code, there is a chance that it could introduce an unintended side effect in the authentication flow. Therefore, automated regression testing of authentication paths after each deployment is no longer optional.
- Increase test coverage for edge-case interaction sequences: Single-step functional tests alone cannot catch such logic bugs. We need to have multi-step tests that show what happens when a real user or a hacker sends multiple requests one after the other.
- Always keep dependencies and patches updated: The bug we mentioned above was in an open-source tool. Companies that use old versions of such admin tools without updating them are at the greatest risk. Automated regression testing against new patches should be part of the release cycle.
AI Is Finding Bugs Faster. Can Your Tests Keep Up?
Google’s GTIG report doesn’t just mention this one security breach. It also reveals a much broader set of incidents: an Android malware family tracked as PromptSpy that misuses Gemini to interpret on-screen elements on the compromised device, highlighting a shift toward automated malware development. We are also seeing foreign hacking groups that create fake code to bypass systems, as well as threat actors pursuing a premium AI tier guardrail bypass alongside automated account-pooling pipelines to avoid usage limits on models. Additionally, they are training AI models using GitHub-hosted datasets containing more than 5,000 documented security vulnerabilities.
This showcases the evolving nature of a malicious cybercrime group technical partnership. So don’t think this competition is just around the corner. It’s already happening.
So, what can QA teams and developers do? They should not rely solely on purchasing new security tools. Instead, test the code you’re currently releasing for vulnerabilities. In the above case, if we had properly tested the edge cases of authentication during development, that bug could have been found much earlier. That opportunity is still in your hands.
Many professional teams are now relying on AI-based testing platforms to help address such problems more quickly. This helps test even large processes more easily, without the headache of traditional tests that require extensive coding and maintenance.
For example, modern automation tools like testRigor can help here. We can write test cases in the language we normally speak. In addition, it helps us accurately test authentication flows and multi-step processes, where logic flaws often hide. In this era of increasing attacks and rapidly evolving threats, increasing test coverage is one of the most effective ways to secure our systems.
| Achieve More Than 90% Test Automation | |
| Step by Step Walkthroughs and Help | |
| 14 Day Free Trial, Cancel Anytime |




