AI security is crossing a line that was always going to matter.

For the last few years, the easier story has been that AI can help find bugs. That is useful, impressive, and mostly comfortable. A tool scans code, identifies a flaw, maybe suggests a patch, and a human team decides what to do next.

The harder story is what happens when the same class of tools can move from detection into exploitation.

That is the center of new research around ExploitGym, a benchmark built to test whether AI agents can turn known software vulnerabilities into working attacks. According to The Register, the benchmark was created by researchers from UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google.

The result is not that AI agents are suddenly perfect hackers. They are not. The result is more uncomfortable than that: frontier agents are already good enough to exploit a meaningful slice of real vulnerabilities when pointed in the right direction.

What ExploitGym tested

ExploitGym is built around 898 real vulnerabilities across applications, Google's V8 JavaScript engine, and the Linux kernel. The setup gives an AI agent a vulnerability and a proof-of-concept input that triggers it, then tests whether the agent can build an exploit capable of arbitrary code execution.

That framing matters. This is not just asking a model to describe a bug. It is asking whether the model, wrapped in an agentic command-line workflow, can do the operational work of turning a weakness into a usable exploit.

The strongest performers were Claude Mythos Preview and GPT-5.5. The Register reports that Mythos Preview successfully exploited 157 test instances, while GPT-5.5 managed 120 in the allotted two-hour window.

Those numbers do not mean these systems can reliably exploit everything. They do mean the capability is no longer theoretical.

The dangerous part is the workflow

A language model on its own is one thing. An agent with tools is another.

The model is not merely producing text. It can inspect output, revise commands, test assumptions, write code, run tools, and continue iterating. That loop is what changes the security picture. A weak model can describe an exploit badly. A stronger agent can try, fail, adjust, and try again.

That makes exploitation less like a single answer and more like a process. The agent does not need to be correct on the first pass. It needs enough persistence, context, and tool access to keep moving.

This is why the benchmark matters. The question is not whether AI can say scary things about vulnerabilities. The question is whether it can perform the boring mechanical work that turns a vulnerability into an actual intrusion path.

Guardrails helped, but did not settle the issue

There is an important caveat: The Register notes that ExploitGym testing was done with security guardrails disabled. When GPT-5.5 was retested with default safety filters active, the model refused 88.2 percent of the time before making any tool call.

That is meaningful. It suggests safety layers can block many direct requests.

But it does not end the story. The Register also reports that security researchers have crafted prompts that avoid triggering refusals. That is the familiar weakness of guardrail-first security: it can reduce obvious abuse, but it does not erase the underlying capability.

Once a model can do a task, policy controls become part of the defense, not the entire defense.

Off-script behavior is the warning sign

One of the more interesting findings is that agents sometimes went off-script in capture-the-flag environments. They were pointed toward one bug, but sometimes found another path to the flag.

That should get more attention than the raw success counts.

A tool that only follows a narrow instruction is easier to reason about. A tool that can discover alternative paths is more powerful, but also harder to contain. In security work, that cuts both ways. Defenders may use diverse agents to find flaws faster. Attackers may use the same diversity to find paths a human operator missed.

The research also found that different models discovered different exploits. That suggests multiple agents can produce complementary results, which is useful for defense and worrying for offense.

This changes vulnerability economics

The real shift is not that every attacker instantly becomes elite. The shift is that exploit development may become cheaper, faster, and more repeatable.

Exploit work has traditionally required skill, patience, target knowledge, and debugging time. Those constraints do not disappear, but automation can erode them. An agent that turns a known flaw into a working exploit some of the time is still significant if it can be run cheaply across many targets.

Security teams already struggle with vulnerability volume. They have scanners, advisories, patch queues, dependency trees, container images, cloud services, and old systems nobody wants to touch. If exploitability becomes easier to test at scale, the distance between disclosure and weaponization shrinks.

That changes prioritization. A vulnerability is no longer just a CVSS score or a paragraph in a bulletin. It becomes a question of whether an agent can produce a working path to code execution before the defender patches.

Good defensive tools will look more like offensive tools

The uncomfortable truth is that defenders need some of the same capabilities.

If AI agents can help attackers test exploitability, defenders need agents that can do the same thing inside controlled environments. Not to spray exploits across production systems, but to answer practical questions: Is this bug reachable in our deployment? Can it be weaponized against our configuration? Does the patch actually close the path? Which exposed systems matter first?

That is where this technology becomes useful rather than just dangerous.

The defensive version needs containment, audit logs, scoped permissions, reproducible evidence, and strict separation from live production. It should not be a magic exploit button wired into sensitive infrastructure. It should be a controlled lab instrument for proving risk before an attacker does.

The security industry has to stop pretending detection is the finish line

Finding a vulnerability is not the same as understanding its real risk.

Some bugs are technically valid but useless in practice. Some look boring and become catastrophic in the right environment. Some are only dangerous when chained with another flaw, a bad configuration, or a leaked credential.

AI agents make that gray area more important. They can test combinations, generate hypotheses, and push past the first obvious answer. That can help defenders cut through noise. It can also help attackers find the one path that works.

The old comfort was that exploit development remained a specialized craft. That is still partly true. But the boundary is moving.

The right response is not panic

The wrong response is to pretend this is science fiction. The other wrong response is to treat every AI security benchmark as proof that machines have replaced expert operators.

The sober reading is narrower and more useful: frontier AI agents can now perform parts of exploit development against real vulnerabilities. They are inconsistent, but improving. Safety filters can block many direct requests, but cannot be treated as a complete solution. The same capability will be used by both attackers and defenders.

That means security teams should prepare for a world where exploitability testing becomes more automated, more common, and closer to the moment a vulnerability becomes public.

The question is no longer whether AI can find the problem.

The question is what happens when it can prove the problem works.

Sources