A new framework from Alibaba researchers decomposes complex tasks iteratively so AI agents fetch only the tools they need, slashing context costs from ~884,000 tokens to ~1,160 per query.

Alibaba researchers have published SkillWeaver, a framework that routes AI agents to the right tools through iterative task decomposition rather than loading an entire tool library into context, reducing token consumption by over 99% compared to brute-force approaches ^[1].

For developers building enterprise agents that orchestrate dozens or hundreds of tools, the practical stakes are direct: fewer tokens mean lower API costs and faster responses, and the accuracy gains address a failure mode that currently makes large tool libraries nearly unusable ^[1].

The problem with loading every tool

Modern large language model (LLM) agents are increasingly integrated with massive tool ecosystems — including systems built on the Model Context Protocol (MCP), a standard for connecting agents to external services ^[1]. When an agent needs to handle a multi-step request like “Download the dataset, transform it, and create visual reports,” exposing the full tool library to the LLM to find the right tools is, as the researchers describe it, “highly inefficient” and “quickly overwhelms context limits” ^[1].

Most existing frameworks treat tool selection as a one-shot, single-skill problem, which breaks down when real business queries require sequencing multiple tools in a compatible chain ^[1].

How SkillWeaver works

SkillWeaver structures tool selection across three stages: Decompose, Retrieve, and Compose ^[1]. First, an LLM breaks a complex user query into atomic sub-tasks, each requiring exactly one tool ^[1]. An embedding model then searches the tool library to produce a shortlist of candidates for each sub-task ^[1]. Finally, a planner checks inter-tool compatibility — whether the output of one tool feeds cleanly into the next — and assembles the result as a Directed Acyclic Graph (DAG), allowing independent steps to run in parallel ^[1].

The central innovation is a feedback mechanism the team calls Skill-Aware Decomposition (SAD) ^[1]. Rather than decomposing a task once and searching, SAD has the LLM draft an initial plan, runs a preliminary search to surface loosely matching tools, and feeds those results back to the LLM so it can rewrite its decomposition using the actual vocabulary of the available tools ^[1]. The authors shared the prompt templates in their paper, and the technique can be reproduced with standard libraries such as LangChain, LlamaIndex, or plain Python ^[1].

Benchmark results

The team evaluated SkillWeaver on CompSkillBench, a custom benchmark of 300 multi-step queries run against a library of 2,209 real-world skills drawn from the public MCP ecosystem across 24 categories including cloud infrastructure, finance, and databases ^[1].

Without SAD, a 7-billion-parameter Qwen2.5-7B-Instruct model correctly predicted the number of required steps only 51.0% of the time; activating SAD pushed that to 67.7%, and using the larger Qwen-Max model brought it to 92% ^[1]. On hard tasks requiring four to five tools, SAD improved accuracy by 50% ^[1].

One counterintuitive finding: a 14-billion-parameter model performed worse than the 7B model in the unguided setup because it tended to over-decompose tasks into unnecessary micro-steps ^[1]. SAD corrected this by anchoring the larger model to the actual tool vocabulary, suggesting that tool alignment can matter more than raw model size ^[1].

The token savings were stark. The brute-force baseline — feeding all tool names into a large model’s prompt — consumed an estimated 884,000 tokens per query and still only retrieved the correct tool category 21.1% of the time ^[1]. SkillWeaver’s targeted approach brought that figure down to roughly 1,160 tokens per query, a 99.9% reduction, while substantially outperforming on accuracy ^[1]. A ReAct-style agent baseline achieved 0% decomposition accuracy, collapsing multi-step plans into isolated actions ^[1].

On retrieval infrastructure, the researchers used the open-source all-MiniLM-L6-v2 embedding model with a FAISS index ^[1]. Indexing all 2,209 skills took 15 seconds, and retrieval adds under 15 milliseconds of latency per query ^[1]. Swapping in a stronger encoder (BGE-base-en-v1.5) improved accuracy without any fine-tuning ^[1].

Limitations and what’s not included

The framework currently has no error-recovery mechanism ^[1]. If a tool call fails mid-chain, the entire execution breaks, and the authors acknowledge that practitioners will need to build their own retry and fallback logic on top of the compose stage ^[1].

The source code has not yet been released, though the researchers note the system was built entirely from off-the-shelf components and the SAD prompt templates are available in the paper ^[1].