Inside the State of the IART: The Forces Shaping Frontier AI

Written by Alfonso Feu Desongles | Dec 15, 2025 9:09:52 AM

Frontier AI evolves so quickly that any snapshot risks becoming outdated within weeks. Weekly model releases, new benchmarks, and rapid architectural innovations give the impression of constant upheaval. But beneath this noise, a more stable pattern is emerging: one defined not just by the familiar notion of the “state of the art,” but by the dimensions that truly determine practical AI capability.

These dimensions – and what matters for anyone building with AI – are a set of deeper trends: reinforcement-trained reasoning, agent-centric architectures, breakthroughs in interpretability, and the shifting global dynamics of AI development. These are the forces that shape real-world capability, regardless of which model dominates a leaderboard this month.

This is why I'm going to discuss the “State of the IART”, which I put forward not simply as a pun, but as a proposal to frame a shift in perspective. Here, I'll map the current frontier, highlight where genuine innovation is happening, and show how these changes translate into practical value for industry.

The Frontier in Motion

At this moment, the leading frontier models form a surprisingly tight cluster. OpenAI’s GPT-5, Anthropic’s Claude Opus 4.1 and Sonnet 4, Google’s Gemini 2.5 Pro and Flash Image, xAI’s Grok 4, Meta’s Llama 4, DeepSeek R1 and the Qwen family, and a wave of Chinese models – including Zhipu GLM-4.5, Moonshot Kimi K2, Baidu ERNIE, Tencent Hunyuan, and Baichuan – each represent a different strategic approach, but many achieve similar levels of general-purpose competence.

At the moment of writing, key models include:

Model / Lab Core Strengths Notes

GPT-5 (OpenAI) Broad general intelligence; strong math/science; tool-use Multiple “thinking” variants available via OpenAI & Azure

Claude Opus 4.1 (Anthropic) Long-context stability; rigorous alignment; document analysis Favoured in safety-critical or regulated contexts

Gemini 2.5 Pro / Flash Image (Google) Leading multimodality; advanced video (Veo 3) Strong enterprise integration via Vertex AI

Grok 4 (xAI) High reasoning benchmarks (AIME, GPQA) Tight integration with X ecosystem

Llama 4 (Meta) Open-weight; private deployment Strong developer ecosystem

DeepSeek R1 + Qwen/Qwen3 (China) Reinforcement-trained reasoning; fast iteration Distilled variants widely adopted

GLM-4.5, Kimi K2, ERNIE, Hunyuan (China) Rapid progress in multilingual & research tasks Part of a growing Chinese frontier cluster

Mistral Large 2; Cohere Command R+ European-hosted; retrieval-heavy workloads
Often chosen for governance or data locality

Model / Lab	Core Strengths	Notes
GPT-5 (OpenAI)	Broad general intelligence; strong math/science; tool-use	Multiple “thinking” variants available via OpenAI & Azure
Claude Opus 4.1 (Anthropic)	Long-context stability; rigorous alignment; document analysis	Favoured in safety-critical or regulated contexts
Gemini 2.5 Pro / Flash Image (Google)	Leading multimodality; advanced video (Veo 3)	Strong enterprise integration via Vertex AI
Grok 4 (xAI)	High reasoning benchmarks (AIME, GPQA)	Tight integration with X ecosystem
Llama 4 (Meta)	Open-weight; private deployment	Strong developer ecosystem
DeepSeek R1 + Qwen/Qwen3 (China)	Reinforcement-trained reasoning; fast iteration	Distilled variants widely adopted
GLM-4.5, Kimi K2, ERNIE, Hunyuan (China)	Rapid progress in multilingual & research tasks	Part of a growing Chinese frontier cluster
Mistral Large 2; Cohere Command R+	European-hosted; retrieval-heavy workloads	Often chosen for governance or data locality

A striking pattern emerges when you look across these models. The United States and China now dominate the top tier, with Meta’s open-weight research and Mistral’s European presence standing as notable alternatives. Despite different training philosophies – closed US models, aggressively open Chinese releases, and hybrid European approaches – the frontier is tightening. Capability gaps shrink more quickly than ever, but the strategic differences in openness, alignment priorities, and deployment environments are growing sharper.

The US–China Race: Open Weights, Imitation, Infrastructure

Taken together, these models form a frontier that is no longer defined purely by technical capability, but by the geopolitical forces behind them, particularly the accelerating divergence between US and Chinese approaches to scale, openness, and deployment.

What distinguishes the US–China competition is as much philosophical as technical. Chinese labs have embraced open-weight releases faster than most analysts expected. DeepSeek R1 is the clearest example: published openly, then distilled into Qwen and Llama backbones. This gives other models part of R1’s reinforcement-trained reasoning ability without retraining from scratch. It is synthetic fine-tuning, one model teaching another.

Ironically, some observers believe DeepSeek itself benefited from synthetic outputs of US models, creating an interesting loop:

closed US models → synthetic data → open Chinese models → global derivatives.

Meanwhile, despite the name OpenAI, ChatGPT remains fully closed. Chinese labs like DeepSeek, Qwen, and Baichuan publish weight checkpoints continuously. Meta is the only major US exception.

Infrastructure differences amplify this divide. China is deploying enormous data centres with fewer permitting barriers, plus national compute-sharing initiatives. In the US, progress is slower due to fragmented regulation and energy constraints. Europe is far behind both. Scaling may become the decisive factor.

Are We Hitting an AI Plateau or Pausing to Breathe?

Yet this geopolitical race exposes a deeper question: even with massive compute and rapid iteration, what fuel remains to keep pushing these models forward? With models scaling rapidly, researchers have begun to ask whether the internet itself is running out of high-quality text for training. Epoch and other groups predict scarcity by the late 2020s or early 2030s.

As a result, the industry is being forced to rethink what continued progress looks like. Some researchers argue that the internet is reaching the limit of high-quality text data.

But progress continues through:

Reinforcement learning and process supervision
Synthetic data generated by stronger models
Multimodal training, reducing dependence on text
Retrieval loops that gather fresh domain-specific data
Active learning pipelines

Reinforcement learning and process supervision allow models to improve without depending on ever-larger datasets. Synthetic data – generated by frontier models themselves – has become a powerful new training resource. Multimodal training lessens reliance on text entirely, while retrieval and active learning pipelines help models draw selectively from fresh, domain-specific sources.

These shifts in how models learn are already reflected in the way we evaluate them. Without recognising that connection, the next developments can seem puzzling. With it, the picture becomes clearer.

Benchmarks: Saturation but Still Some Signal

These shifts in how models learn are already visible in the benchmarks used to measure them. Some benchmarks have effectively maxed out, while others continue to reveal meaningful gaps. For instance, classic benchmarks like MMLU are saturating. Harder versions, such as MMLU-Pro, are appearing to regain discriminative power, but the most informative insights now come from specialised or real-world evaluations.

GPQA Diamond, which measures graduate-level physics and chemistry reasoning, continues to expose meaningful gaps. AIME remains one of the clearest indicators of mathematical depth. SWE-bench Verified, which tests models on real GitHub issues, shows how quickly agentic systems are improving: accuracy jumped dramatically in just a year.

However, Humanity’s Last Exam (HLE) stands out as the new final benchmark. It tries to avoid the saturation we see now in MMLU and others. HLE mixes hard problems from many university disciplines and even multi-modal questions with diagrams and tables. The important point: frontier models collapse here. While humans with real expertise score around 90%, the best models stay near 30%.

For me, this shows something clear: LLMs are progressing fast in coding, tools, and agent systems, but deep academic reasoning is still far from human-level. This uneven pattern of progress – rapid gains in some areas, stubborn ceilings in others – has reshaped how researchers talk about AGI and where real capability lies today.

From AGI Dreams to Practical Super-Capability

Benchmark behaviour underscores a key reality: progress is not uniform. Some abilities plateau quickly; others advance rapidly. This inconsistency has pushed the AGI conversation into more grounded territory. Instead of debating when AI might display “general” intelligence, the field is focusing on areas where systems are already becoming super-capable.

Two trends reinforce each other:

Reinforcement-trained reasoning (o-series, DeepSeek R1) – Better planning, better code, fewer mistakes.
AI optimising AI (AlphaEvolve, GPU kernel optimisation, tool integration) – Systems improving themselves.

Reinforcement-trained reasoning is one, but AI-optimised AI is another. These are systems that can improve their own GPU kernels or orchestrate complex workflows. Google’s “banana” era with Flash Image may look playful, but beneath the surface, it represents deep improvements in controllable and stable generation across modalities.

But raw capability alone doesn’t explain the most important shift underway. Increasingly, the biggest transformations aren’t coming from the models themselves, but from the systems built around them. In other words: agents.

The New Trend: Agentic Systems

If last year was about prompting, this year is about orchestration. All frontier models now follow a new philosophy: the focus is no longer the model alone, but the agent system wrapped around it.

Modern agents behave less like chatbots and more like autonomous workers.

The key ingredients are:

Tool execution – running code, shell commands, API calls, browsers
Planning graphs – multi-step reasoning, task decomposition, dependency graphs
Self-reflection loops – rewriting their own plans, checking assumptions
Verification systems – self-consistency, compiler-like passes, test generation
Multimodal memory – not just text, but images, diagrams, embeddings
Long-running sessions – keeping state for hours or days

This is the foundation of systems like:

OpenAI Codex and Codex-Max

The successor to Codex is not only generating code but executing it, generating unit tests, and keeping an internal abstract syntax tree (AST) of the project to track changes. This AST acts as the agent’s “internal map” of the codebase, allowing it to avoid drifting and to maintain stable context.

Anthropic Claude Code

Claude Code uses a highly stable internal planning loop. It maintains “working context” by compressing progress into structured summaries that behave like internal memory slots. Claude is exceptional at avoiding drift thanks to its Constitutional self-critique and precise code-diff reasoning.

Google Antigravity

Antigravity is Google’s most ambitious experiment: an environment where the model executes code, maintains working memory, and checks itself in cycles. Its context management is based on implicit working memory graphs, not raw tokens. This lets it keep projects in its head for hours without losing coherence.

How Agents Maintain Context Without Drifting

Modern agents use several techniques:

Internal AST representation for code (Codex, Claude Code, Gemini Agents)
State compression into “workspace embeddings” updated each step
Self-consistency passes (multi-sample verification)
Compiler-style checking after each action
Deterministic execution logs to maintain alignment with ground truth
Project state snapshots written and reloaded during long sessions

In 2023, an LLM was simply “predict the next token”. In 2025, an LLM is: an autonomous worker with tools, memory, and goals. These developments push the field toward questions of safety and reliability. As agents take on more autonomy, the need to understand—not just observe—their internal behaviour becomes increasingly urgent, which leads naturally into interpretability.

Interpretability: Peeking Inside the Black Box

Interpretability has moved from academic curiosity to practical necessity. Anthropic has led the most visible breakthroughs in mechanistic interpretability. In their study Mapping the Mind of a Large Language Model, they showed how Claude encodes millions of “features”, or patterns of neuron activation that correspond to recognisable concepts such as the Golden Gate Bridge. Using a compute-intensive dictionary-learning method, they could trace circuits that reveal how the model anticipates future tokens.

In another paper, Tracing Thoughts in Language Models, they demonstrated that Claude often plans ahead, for example, predicting rhyme words before they appear, something that looks like a microscope view into AI cognition. These findings suggest that emergent behaviours – sycophancy, malevolent tendencies, or biases – can be anticipated and potentially controlled by adjusting data exposure or by directly intervening in neural activations. More detailed work, such as Scaling Monosemanticity, pushes toward extracting interpretable features systematically at scale.

At the same time, OpenAI has investigated one of the most pressing interpretability issues: hallucinations. In their recent paper Why Language Models Hallucinate they argue that the root cause lies not only in architecture but in the incentives of training and evaluation. Models are rewarded for giving an answer even when uncertain, and penalised if they abstain. This overconfidence, reinforced during pretraining and benchmark testing, leads to fluent but false outputs. As OpenAI puts it, the system is trained to “guess rather than admit ignorance.”

The insight is striking because it complements Anthropic’s circuit-level view: where Anthropic maps the “where” and “how” of internal reasoning, OpenAI identifies the “why” hallucinations persist under current training regimes. Together, these efforts bring the field closer to reliable detection and mitigation of unsafe behaviour. The shared goal is clear: by 2027, interpretability should allow developers not only to observe but to control the internal dynamics of frontier models in a way that reduces hallucinations and aligns reasoning more directly with truth.

Beyond Transformers: New Architectures and Efficiency

For the past seven years, almost every major AI breakthrough has been built on the same underlying architecture: the transformer, a neural network design introduced by Google in 2017. Transformers made it possible for models to analyse text in parallel rather than sequentially, learn long-range patterns, and scale to billions or trillions of parameters. GPT-5, Claude Opus, Gemini 2.5 Pro, Llama 4, and most other frontier systems are all transformer-based.

But as interpretability research exposes the growing complexity of these models, and as their computational demands continue to climb, researchers are now exploring new architectures that promise greater efficiency, stability, and transparency. Two lines are moving fast. State Space Models (SSMs) and latent-space modelling.

SSMs reduce the quadratic cost of attention and keep a long memory with linear time. Mamba introduced selective SSMs and showed strong results across text, audio, and genomics with linear scaling in sequence length. RWKV keeps recurrent inference while training in parallel like a transformer, reaching competitive quality at lower cost. RetNet from Microsoft adds a retention mechanism with parallel, recurrent, and chunkwise modes, targeting O(1) inference per step and efficient long context. Hybrids are practical now: Jamba interleaves Mamba and transformer layers (plus MoE) to get long-context throughput with transformer-level quality; weights and paper are public. These families are not replacing transformers everywhere yet, but they already make sense for long sequences, streaming, and edge scenarios.

Latent-space approaches attack efficiency from another angle. Instead of predicting tokens one by one, the model reasons in an abstract space. JEPA (LeCun/Meta) learns by predicting missing parts of images or video in a representation space, not raw pixels; see I-JEPA and V-JEPA for the core idea and the video extension (I-JEPA, V-JEPA). Meta’s Large Concept Model (LCM) goes further: it models sequences in a sentence-embedding space and reports multilingual gains vs. same-size token LLMs; code and paper are available. The promise is simple: higher-level units → less data and energy per capability point, and better transfer across languages and modalities.

Efficiency at inference also advances fast. FlashAttention-3 squeezes H100/H200 GPUs with asynchrony and low precision, giving ~1.5–2× speedups and strong FP8 support. Speculative decoding removes sequential bottlenecks by drafting multiple tokens and verifying them; Medusa does this with extra heads on a single model, reaching >2× speedups in tests. On the serving side, vLLM/PagedAttention fixes KV-cache fragmentation so you batch more requests with the same memory budget, often 2–4× throughput vs. older stacks. Combined, these ideas cut latency, raise throughput, and reduce cost per token, which is useful whether you run transformers, Mamba hybrids, or concept-space models.

Practical Implications for Industry

In a field that evolves as quickly as frontier AI, trying to define the “state of the art” has almost become a contradiction. The details shift too fast. But what does remain stable, and what this article has traced, is the underlying pattern of change: where models are getting stronger, where they still fail, and how the surrounding systems, architectures, and global dynamics shape what comes next.

If there is such a thing as a “State of the IART”, it is not a taxonomy but a reminder. A reminder that intelligence alone is not the story; that capability must be matched with reliability; that reasoning strength varies wildly across tasks; and that real value emerges only when these systems can act predictably in the world through tools, workflows, and guardrails. These are the dimensions along which progress becomes meaningful for organisations, not just impressive on paper.

For industry, this means shifting attention away from leaderboard races and toward architectural decisions: which models integrate safely into existing processes, which agent systems behave consistently under load, which deployments meet governance requirements, and which capabilities map to real business outcomes. Frontier AI is no longer something to watch from a distance. It is a set of evolving design choices that determine how organisations work, build, and compete.

What We Are Doing

At Mimacom, we work on safe agentic systems that can call tools, execute code, and integrate directly with enterprise stacks.

Our priorities are clear:

Evaluation over hype. We reproduce business KPIs with candidate models, test robustness to prompt changes, and measure cost against quality.
Regulated contexts. We rely on models that support process supervision and verifiable tool calls.
Deployment flexibility. For on-premise needs, we choose open weights like Llama or Qwen with accelerators. For cloud, we combine Azure/OpenAI, Vertex/Gemini, or Bedrock/Claude, depending on locality and procurement.

And now we go further with Flowable AI Studio, a new solution that extends Flowable’s process automation with integrated AI capabilities. It allows enterprises to design, orchestrate, and monitor AI-powered workflows in a governed environment. This means moving beyond experiments and pilots, and AI can become part of reliable, auditable business processes.

If you want to learn more about how Mimacom and Flowable AI Studio can bring these frontier AI capabilities into your organisation, contact us or explore our latest resources and industry solutions.

View full post