A new decentralized framework from Stanford replaces the traditional central-controller model for multi-agent AI systems, cutting per-task costs roughly in half on software engineering benchmarks.

A Stanford research framework called DeLM — short for decentralized language model — challenges one of the foundational assumptions of modern multi-agent AI systems: that a central orchestrator is required to coordinate agent activity. [1] According to the researchers, eliminating that central controller can cut per-task costs by roughly 50% while improving accuracy. [1]

In a conventional centralized multi-agent setup, a main agent breaks work into subtasks, dispatches them to sub-agents, waits for results, then merges and rebroadcasts context before issuing the next round of instructions. [1] The Stanford team argues this architecture scales poorly: as the number of subtasks grows, the central controller becomes a communication and integration bottleneck, and may “dilute, omit, or distort” useful information in the process. [1]

DeLM replaces that hub-and-spoke model with three components: parallel agents, a shared context store, and a task queue. [1] The shared context holds compressed information summaries — called “gists” — that include verified findings, partial findings, and documented failures, along with pointers to detailed evidence agents can retrieve on demand. [1] Agents write results directly into this shared store rather than routing them through a central controller. [1]

The pipeline works in five stages: inputs are broken into work units and queued; agents pull tasks and read shared context in parallel; results are compressed into gists and verified before being shared; a final agent checks whether additional work is needed; and that agent returns the answer once no further steps are required. [1]

A key design feature is what the researchers call “unfolding” — agents see short gists by default but can expand them into fuller summaries or raw evidence when needed. [1] Co-developer Yuzhen Mao explained the tradeoff: “If agents shared full traces, each worker would need to read long command histories, file dumps, failed edits, and intermediate reasoning, turning coordination itself into another long-context bottleneck.” [1] Conversely, sharing only compact summaries risks losing critical details. [1] The opt-in unfolding mechanism is designed to balance cost and accuracy. [1]

The framework also addresses a specific inefficiency in parallel agent runs: when one agent pursues a dead-end reasoning path, that failure is normally invisible to other agents, which may then repeat the same mistake. [1] With DeLM, failed hypotheses are written into shared context so that, as Mao noted, “later agents can read them as constraints, avoid repeated exploration, and redirect their search toward more promising fixes.” [1]

On SWE-bench Verified — a benchmark that evaluates AI systems on real-world software engineering problems — DeLM performed 10.5% better than the strongest baseline and reduced cost per task by roughly 50%. [1] On LongBench-v2 Multi-Doc question answering, which tests long-context reasoning across multiple documents, DeLM achieved the highest accuracy across four model families, including GPT-5.4, Claude Sonnet, Gemini Flash, and DeepSeek-V4-Pro. [1]

The framework is described as particularly suited to software engineering test-time scaling — where models are given additional compute time to improve reasoning — as well as long-context tasks where multiple agents can examine separate evidence clusters while maintaining a shared global view of accumulated findings. [1]

For developers and enterprise teams building multi-agent workflows, the practical implication is that the central-orchestrator pattern, while intuitive, carries measurable costs in both inference spend and coordination latency that a shared-state, decentralized design may avoid. [1]


Sources

  1. VentureBeat — Stanford's DeLM cuts multi-agent task costs 50% — without a central orchestrator

This article was drafted with AI from the cited sources and checked against them before publication. Spot an error? Let us know.