Red teaming LLMs exposes a harsh truth about the AI security arms race

0



Unrelenting, persistent attacks on frontier models make them fail, with the patterns of failure varying by model and developer. Red teaming shows that it’s not the sophisticated, complex attacks that can bring a model down; it’s the attacker automating continuous, random attempts that will inevitably force a model to fail.

That’s the harsh truth that AI apps and platform builders need to plan for as they build each new release of their products. Betting an entire build-out on a frontier model prone to red team failures due to persistency alone is like building a house on sand. Even with red teaming, frontier LLMs, including those with open weights, are lagging behind adversarial and weaponized AI.

The arms race has already started

Cybercrime costs reached $9.5 trillion in 2024 and forecasts exceed $10.5 trillion for 2025. LLM vulnerabilities contribute to that trajectory. A financial services firm deploying a customer-facing LLM without adversarial testing saw it leak internal FAQ content within weeks. Remediation cost $3 million and triggered regulatory scrutiny. One enterprise software company had its entire salary database leaked after executives used an LLM for financial modeling, VentureBeat has learned.

The UK AISI/Gray Swan challenge ran 1.8 million attacks across 22 models. Every model broke. No current frontier system resists determined, well-resourced attacks.

Builders face a choice. Integrate security testing now, or explain breaches later. The tools exist — PyRIT, DeepTeam, Garak, OWASP frameworks. What remains is execution.

Organizations that treat LLM security as a feature rather than a foundation will learn the difference the hard way. The arms race rewards those who refuse to wait.

Red teaming reflects how nascent frontier models are

The gap between offensive capability and defensive readiness has never been wider. "If you've got adversaries breaking out in two minutes, and it takes you a day to ingest data and another day to run a search, how can you possibly hope to keep up?" Elia Zaitsev, CTO of CrowdStrike, told VentureBeat back in January. Zaitsev also implied that adversarial AI is progressing so quickly that the traditional tools AI builders trust to power their applications can be weaponized in stealth, jeopardizing product initiatives in the process.

Red teaming results to this point are a paradox, especially for AI builders who need a stable base platform to build from. Red teaming proves that every frontier model fails under sustained pressure.

One of my favorite things to do immediately after a new model comes out is to read the system card. It’s fascinating to see how well these documents reflect the red teaming, security, and reliability mentality of every model provider shipping today.

Earlier this month, I looked at how Anthropic’s versus OpenAI’s red teaming practices reveal how different these two companies are when it comes to enterprise AI itself. That’s important for builders to know, as getting locked in on a platform that isn’t compatible with the building team’s priorities can be a massive waste of time.

Attack surfaces are moving targets, further challenging red teams

Builders need to understand how fluid the attack surfaces are that red teams attempt to cover, despite having incomplete knowledge of the many threats their models will face.

A good place to start is with one of the best-known frameworks. OWASP's 2025 Top 10 for LLM Applications reads like a cautionary tale for any business building AI apps and attempting to expand on existing LLMs. Prompt injection sits at No. 1 for the second consecutive year. Sensitive information disclosure jumped from sixth to second place. Supply chain vulnerabilities climbed from fifth to third. These rankings reflect production incidents, not theoretical risks.

Five new vulnerability categories appeared in the 2025 list: excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Each represents a failure mode unique to generative AI systems. No one building AI apps can ignore these categories at the risk of shipping vulnerabilities that security teams never detected, or worse, lost track of given how mercurial threat surfaces can change.

"AI is fundamentally changing everything, and cybersecurity is at the heart of it. We're no longer dealing with human-scale threats; these attacks are occurring at machine scale," Jeetu Patel, Cisco's President and Chief Product Officer, emphasized to VentureBeat at RSAC 2025. Patel noted that AI-driven models are non-deterministic: "They won't give you the same answer every single time, introducing unprecedented risks."

"We recognized that adversaries are increasingly leveraging AI to accelerate attacks. With Charlotte AI, we're giving defenders an equal footing, amplifying their efficiency and ensuring they can keep pace with attackers in real-time," Zaitsev told VentureBeat.

How and why model providers validate security differently

Each frontier model provider wants to prove the security, robustness, and reliability of their system by devising a unique and differentiated red teaming process that is often explained in their system cards.

From their system cards, it doesn’t take long to see how different each model provider’s approach to red teaming reflects how different each is when it comes to security validation, versioning compatibility or the lack of it, persistence testing, and a willingness to torture-test their models with unrelenting attacks until they break.

In many ways, red teaming of frontier models is a lot like quality assurance on a commercial jet assembly line. Anthropic’s mentality is comparable to the well-known tests Airbus, Boeing, Gulfstream, and others do. Often called the Wing Bend Test or Ultimate Load Test, the goal of these tests is to push a wing’s strength to the breaking point to ensure the most significant safety margins possible.

Be sure to read Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 55-page GPT-5 system card to see firsthand how different their measurement philosophies are. Anthropic relies on multi-attempt attack success rates from 200-attempt reinforcement learning campaigns. OpenAI reports single-attempt jailbreak resistance.

Gray Swan's Shade platform ran a series of adaptive adversarial campaigns against Claude models. The attack success rate (ASR) metrics quantify just how quickly models can degrade and break under pressure. Opus 4.5 in coding environments hit 4.7% ASR at one attempt, 33.6% at 10 attempts, and 63.0% at 100 attempts. In computer use with extended thinking, it held at 0% ASR even after 200 attempts, becoming the first model to saturate the benchmark. Sonnet 4.5 at the same thresholds showed 70% ASR in coding and 85.7% in computer use. The gap between model tiers within the same family is now larger than some cross-vendor comparisons.

On Gray Swan's prompt injection benchmark, Claude Opus 4.5 shows the strongest resistance with an attack success rate of 4.7%, compared with 12.5% for Gemini 3 Pro and 21.9% for GPT-5.1.

For OpenAI's o1, the system card reported 6% ASR for harmful text and 5% for malicious code based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust found GPT-5's raw ASR hit 89% initially, dropping below 1% within two weeks of patching.

The report confirms the pattern: With sufficient attempts, Claude 3.5 Sonnet showed 78% ASR, and GPT-4o reached 89%.

Models attempt to game tests during red teaming — adding to the paradox

Models scheme to turn off the oversight mechanisms when they predict they will be shut down. In Apollo Research's evaluation of o1, the model attempted to turn off oversight mechanisms 5% of the time when it thought it would be shut down. In 2% of cases, o1 attempted self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time.

In 37% of scenarios, the model pursued developer-aligned goals during evaluation, then switched to its own goals when it believed oversight was minimal. Anthropic reports Opus 4.5's evaluation awareness dropped from 26.5% (Opus 4.1) to less than 10% internally.

Evaluating Anthropic versus OpenAI red teaming results

Dimension

Claude Opus 4.5

GPT-5

o1

System card length

153 pages

55 pages

~40 pages

Attack methodology

200-attempt RL campaigns

Single-attempt + patching

Single-attempt

ASR @ 1 attempt

4.7% (coding)

89% raw (pre-patch)

5-6%

ASR @ 200 attempts

0% (saturates benchmark)

Not published

Not published

Prompt injection defense

96-99.4% prevention

~20% ASR (Fujitsu)

Higher than GPT-4o

Interpretability

~10M neural features

CoT monitoring

CoT monitoring

Evaluation awareness

<10%

Identifies exact eval

37% alignment faking

Scheming behavior

Orders of magnitude lower

Not published

5% oversight off; 2% self-exfil

Sources: Opus 4.5 system card, GPT-5 system card, o1 system card, Gray Swan, METR, Apollo Research

When models attempt to game a red teaming exercise if they anticipate they’re about to be shut down, AI builders need to know the sequence that leads to that logic being created. No one wants a model resisting being shut down in an emergency or commanding a given production process or workflow.

Defensive tools struggle against adaptive attackers

"Threat actors using AI as an attack vector has been accelerated, and they are so far in front of us as defenders, and we need to get on a bandwagon as defenders to start utilizing AI," Mike Riemer, Field CISO at Ivanti, told VentureBeat.

Riemer pointed to patch reverse-engineering as a concrete example of the speed gap: "They're able to reverse engineer a patch within 72 hours. So if I release a patch and a customer doesn't patch within 72 hours of that release, they're open to exploit because that's how fast they can now do it," he noted in a recent VentureBeat interview.

An October 2025 paper from researchers — including representatives from OpenAI, Anthropic, and Google DeepMind — examined 12 published defenses against prompt injection and jailbreaking. Using adaptive attacks that iteratively refined their approach, the researchers bypassed defenses with attack success rates above 90% for most. The majority of defenses had initially been reported to have near-zero attack success rates.

The gap between reported defense performance and real-world resilience stems from evaluation methodology. Defense authors test against fixed attack sets. Adaptive attackers are very aggressive in using iteration, which is a common theme in all attempts to compromise any model.

Builders shouldn’t rely on frontier model builders' claims without also conducting their own testing.

Open-source frameworks have emerged to address the testing gap. DeepTeam, released in November 2025, applies jailbreaking and prompt injection techniques to probe LLM systems before deployment. Garak from Nvidia focuses on vulnerability scanning. MLCommons published safety benchmarks. The tooling ecosystem is maturing, but builder adoption lags behind attacker sophistication.

What AI builders need to do now

"An AI agent is like giving an intern full access to your network. You gotta put some guardrails around the intern." George Kurtz, CEO and founder of CrowdStrike, observed at FalCon 2025. That quote typifies the current state of frontier AI models as well.

Meta's Agents Rule of Two, published October 2025, reinforces this principle: Guardrails must live outside the LLM. File-type firewalls, human approvals, and kill switches for tool calls cannot depend on model behavior alone. Builders who embed security logic inside prompts have already lost.

"Business and technology leaders can't afford to sacrifice safety for speed when embracing AI. The security challenges AI introduces are new and complex, with vulnerabilities spanning models, applications, and supply chains. We have to think differently," Patel told VentureBeat previously.

  • Input validation remains the first line of defense. Enforce strict schemas that define exactly what inputs the LLM endpoints being designed can accept. Reject unexpected characters, escape sequences, and encoding variations. Apply rate limits per user and per session. Create structured interfaces or prompt templates that limit free-form text injection into sensitive contexts.

  • Output validation from any LLM or frontier model is a must-have. LLM-generated content passed to downstream systems without sanitization creates classic injection risks: XSS, SQL injection, SSRF, and remote code execution. Treat the model as an untrusted user. Follow OWASP ASVS guidelines for input validation and sanitization.

  • Always separate instructions from data. Use different input fields for system instructions and dynamic user content. Prevent user-provided content from being embedded directly into control prompts. This architectural decision prevents entire classes of injection attacks.

  • Think of regular red teaming as the muscle memory you always needed; it’s that essential. The OWASP Gen AI Red Teaming Guide provides structured methodologies for identifying model-level and system-level vulnerabilities. Quarterly adversarial testing should become standard practice for any team shipping LLM-powered features.

  • Control agent permissions ruthlessly. For LLM-powered agents that can take actions, minimize extensions and their functionality. Avoid open-ended extensions. Execute extensions in the user's context with their permissions. Require user approval for high-impact actions. The principle of least privilege applies to AI agents just as it applies to human users.

  • Supply chain scrutiny cannot wait. Vet data and model sources. Maintain a software bill of materials for AI components using tools like OWASP CycloneDX or ML-BOM. Run custom evaluations when selecting third-party models rather than relying solely on public benchmarks.



Source link

You might also like
Leave A Reply

Your email address will not be published.