674 Attacks in 3 Hours: How AI Red Teaming Agents Are Rewriting the Security Playbook
Adversarial probing of AI systems has accumulated a sprawling toolkit over the past three years. Attack techniques with names like Tree of Attacks with Pruning, Crescendo, and Skeleton Key sit alongside hundreds of prompt transforms and scoring methods across frameworks including Microsoft's PyRIT, NVIDIA's Garak, and Promptfoo. The catalog has grown faster than any human operator can fluently navigate it.
That mismatch is now driving a fundamental shift: from human-orchestrated adversarial testing to agent-orchestrated assessment. And the early results change the economics of AI security testing entirely.
In May 2026, security firm Dreadnode published research describing an AI red teaming agent that took a single operator from natural-language goals to 674 executed attacks against Meta's Llama Scout in roughly three hours — with an 85% attack success rate, reaching 100% on specific technique-target combinations. Traditional AI red teaming frameworks required operators to manually configure attacks, transforms, scorers, datasets, and execution pipelines. The agent crystallizes what was previously a week of engineering into an afternoon of automated execution.
The Catalog Problem: Why Manual Red Teaming Doesn't Scale
The adversarial testing landscape has grown beyond what any individual operator can master. Microsoft's PyRIT framework alone supports dozens of attack types, prompt transforms, and scoring strategies. NVIDIA's Garak adds hundreds more. Layering multimodal inputs, retrieval-augmented generation targets, and chain-of-tool-calling scenarios on top of this creates a combinatorial explosion that manual red teaming simply cannot cover.
The traditional answer was to hire more red teamers or contract external engagements. But external assessments happen quarterly or annually — at best. Between assessments, models receive updates, new tools are integrated, and attack surfaces shift. The velocity mismatch between model iteration speed and security testing cadence creates what the NIST AI Risk Management Framework calls the testing gap: the period during which a deployed model may contain vulnerabilities not yet identified by any assessment process.
Microsoft's internal AI Red Team, which attacked over 100 generative AI products between 2024 and 2025, reached a conclusion that every organization deploying AI agents needs to internalize: prompt and script attacks combined with fuzzing are more effective than classic ML evasion attacks at breaking AI systems. In other words, the threats to AI agents are automation-friendly — they scale well with automated tooling because they target the prompt-response interface rather than requiring model internals access.
What the Agent Layer Changes
The operational pattern across these agent-orchestrated assessment systems is consistent. An operator describes a goal in plain language: "Find material that the model should refuse to generate." The agent picks attack strategies, applies transforms (Base64 encoding, persona framing, low-resource language translation), runs the attacks against the target, scores the results with an LLM judge, and maps findings to compliance frameworks — OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF.
The Dreadnode paper's results against Llama Scout illustrate the throughput:
| Technique | Transform | Success Rate |
|---|---|---|
| Crescendo | Multiple variants | 100% |
| Graph of Attacks with Pruning | Multiple variants | 100% |
| Skeleton Key framing | Persona transform | 100% |
| Tree of Attacks with Pruning | Base64 encoding | 75% |
Across 68 adversarial goals spanning harmful content and bias categories, three attack types with five transform variants reached an 85% attack success rate. The techniques with lower performance — Base64 encoding at 75% — suggest the model detected encoded payloads more reliably than role-play framings, an insight that would take a human red teamer hours to discover and confirm.
What the Headline Numbers Leave Out
Several qualifications matter for any team thinking about adopting this approach.
The three-hour figure covers a focused slice. The paper acknowledges that comprehensive assessments across all attack types and harm categories run closer to days. The Llama Scout case study is a targeted demonstration, not a full-coverage assessment.
Llama Scout is a 17-billion-parameter mid-size model. 85% attack success against a mid-size open model released in April 2025 says little about results against current frontier systems from OpenAI, Anthropic, or Google.
The orchestrating agent constrains coverage. Co-author Raja Sekhar Rao Dheekonda noted that "the orchestrating agent itself refuses to compose legitimate AI red teaming workflows because the underlying model interprets the operator's objective as harmful." Highly aligned frontier models decline to generate offensive workflows for sensitive categories — the same safety mechanisms being tested become obstacles to the testing itself. The Llama Scout study used Moonshot AI's Kimi 2.5 model as both attacker and judge specifically to circumvent this constraint.
Expert humans still outperform agents on nuanced attacks. Skilled human red teamers retain an advantage on "nuanced long-horizon reasoning, highly contextual social engineering scenarios, novel exploit chains, and emerging attack surfaces where there is limited prior attack history."
Disclosure and responsible testing practices are unsettled. The Dreadnode paper published verbatim outputs including shellcode loaders and chemical synthesis steps without prior coordination with Meta. As agent-orchestrated testing democratizes adversarial capability, the norms around responsible disclosure for AI model vulnerabilities remain undefined.
Where Humans Add Value in an Automated Pipeline
If the agent handles attack selection, execution, and scoring, what is left for the human? Three things that may matter more than the attacks themselves:
Triage. A dashboard reporting 232 critical findings with automatic compliance tags is easy to mistake for security. The real skill is deciding which of several hundred automated findings reflects real risk in a specific deployment context, which reflects scorer artifacts rather than genuine vulnerabilities, and which reflects acceptable risk given the model's use case and access controls.
Runtime context. Automated testing probes the model in isolation. It tests what the model can generate given the right input — not what a specific deployment, with specific tool access, specific data sources, and specific runtime policies, would actually allow. A finding that the model can produce harmful output under adversarial prompting may be irrelevant if the deployment's tool authorization layer would block the corresponding action.
Adversarial coverage design. The agent selects attacks from a catalog. The human designs the attack surface map — identifying which tools, which data sources, which integration points, and which trust boundaries matter most in a specific deployment. This scoping work determines which 674 attacks get run, against which target surfaces, priority-ordered by business impact.
The Accessibility Question
Lowering the operational floor for adversarial testing benefits defenders and motivated actors alike. The underlying attack techniques are already public — Skeleton Key, Crescendo, Tree of Attacks with Pruning are all documented in academic literature and open-source tooling. The meaningful change is access and scale: composition and orchestration work that previously required scripting expertise can now be executed with lower technical overhead.
This cuts both ways. It means security teams can run continuous assessments rather than periodic manual engagements. It also means adversaries who previously lacked the expertise to conduct sophisticated multi-step attacks against AI systems can now do so through agent orchestration.
The asymmetry favors defenders who move first. The countermeasure — continuous, automated assessment integrated into the deployment pipeline — is the same capability, applied proactively. Organizations that run attacks against their own agents before adversaries do will surface vulnerabilities on their timeline, not the attacker's.
Implementing an Agentic Red Teaming Program
For organizations deploying AI agents into production, the minimum viable adversarial testing program in 2026 should include:
1. Pre-Deployment Baseline
Before any agent reaches production, run a catalog-based automated assessment covering OWASP Agentic AI Top 10 risks: prompt injection, tool misuse, context poisoning, privilege escalation, and supply chain injection. Use at minimum two different scoring frameworks to reduce single-judge bias. Expect this to surface dozens of findings — most benign, some critical.
2. Continuous Post-Deployment Probing
Adversarial testing is not a one-time gate. Every model update, every new tool integration, every configuration change creates new attack surface. Schedule automated assessments at the cadence of your deployment pipeline — not quarterly, but weekly or per-release.
3. Runtime Audit as Detection
Automated testing tells you what vulnerabilities exist. It does not tell you whether they have been exploited. This is where the runtime audit trail becomes the companion to adversarial testing. Facio (the HITL-first agent runtime) captures every tool invocation with full parameter traceability — making anomalous patterns visible. A finding that the model can be tricked into tool misuse becomes actionable when you can also confirm whether that misuse has occurred in production.
4. Human Review at the Remediation Boundary
When automated testing surfaces a finding that requires architectural change — a new tool authorization rule, a sandbox configuration, a policy update — human review ensures the fix does not break legitimate workflows. Placet.io (the HITL inbox and messenger) provides the structured approval layer for security-relevant configuration changes, maintaining the audit record of who approved what and why.
5. Purple Teaming for Agentic Systems
Adversarial testing reveals vulnerabilities. Blue team defenses close them. Purple teaming — where red and blue teams operate in tandem, sharing findings in real-time — is especially critical for AI agents because every model update and tool integration expands the attack surface. Automated agentic red teaming makes purple teaming continuous rather than event-driven.
The Bottom Line
AI red teaming is not becoming automated. It has become automated. The question is whether your organization runs adversarial tests against your own agents at machine velocity — or whether adversaries discover your vulnerabilities first.
The organizations that get this right are not the ones with the largest red teams. They are the ones that integrate automated adversarial testing into their deployment pipeline, pair it with runtime audit visibility to detect exploitation, and apply human judgment where it matters most — triage of findings, scoping of attack surface, and remediation decisions.
Continuous assessment is the new baseline. Anything less than that is a testing gap. And gaps are where breaches happen.
Further reading: