The MIT-licensed DSpark system uses speculative decoding to accelerate large language model inference and has been tested on DeepSeek's own frontier models as well as Qwen and Gemma.

DeepSeek has released DSpark, an open-source framework designed to accelerate large language model (LLM) inference — the process of generating text responses — without altering the underlying model’s outputs. [1] The release, published over the weekend of June 28–29, 2026, includes a technical paper, model checkpoints, and a codebase called DeepSpec for training and evaluating speculative decoding systems. [1] All materials are available on DeepSeek’s public GitHub and Hugging Face pages under the MIT license, making them freely usable for research and commercial purposes. [1]

The framework addresses one of the most costly problems in AI deployment: serving large models quickly enough for real users while keeping hardware costs manageable. [1] That challenge affects consumer chatbots, coding assistants, agentic workflows, and enterprise AI systems where users expect fast, streaming responses. [1]

How DSpark Works

DSpark is built on speculative decoding, an established inference technique in which a smaller, faster draft component proposes several likely next tokens, and the larger target model then verifies that batch in parallel rather than generating each token one at a time. [1] If the draft’s guesses are correct, the system advances multiple tokens at once; if a guess is wrong, the system discards it and any tokens after it, then corrects course. [1] The technique is designed to increase speed without changing the target model’s intended output distribution. [1]

DSpark introduces two specific improvements over prior speculative decoding approaches. [1] First, it uses what DeepSeek calls semi-autoregressive generation, combining a parallel drafting backbone — which proposes multiple tokens simultaneously — with a lightweight sequential head that accounts for relationships between nearby tokens, reducing incoherent guesses. [1] Second, it adds confidence-scheduled verification, in which a hardware-aware scheduler dynamically adjusts how many draft tokens are sent for verification based on model confidence and current serving load, rather than always checking a fixed number. [1] Under heavier traffic, the system trims low-confidence trailing guesses before they consume batch capacity needed for other users. [1]

Reported Performance Gains

DeepSeek applied DSpark to two variants of its DeepSeek-V4 model: DeepSeek-V4-Flash, a 284-billion-parameter mixture-of-experts (MoE) model with 13 billion active parameters, and DeepSeek-V4-Pro, a 1.6-trillion-parameter MoE model with 49 billion active parameters; both support context windows up to one million tokens. [1]

In live production tests, DSpark improved aggregate throughput by 51% for V4-Flash at an 80-token-per-second-per-user service target, and by 52% for V4-Pro at a 35-token-per-second-per-user target. [1] At matched system capacity, DeepSeek reports per-user generation speedups of 60% to 85% for V4-Flash and 57% to 78% for V4-Pro compared with its prior MTP-1 production baseline. [1]

DeepSeek also reports much larger figures — 661% and 406% throughput increases for V4-Flash and V4-Pro respectively — but these measure aggregate system output under very strict per-user speed targets of 120 and 50 tokens per second, conditions under which the older MTP-1 baseline approaches an operational bottleneck. [1] The 60%–85% figures are the more directly comparable measure of how much faster individual users receive tokens under equivalent serving conditions. [1]

Offline Tests on Qwen and Gemma

DeepSeek also tested DSpark offline against Qwen3-4B, Qwen3-8B, Qwen3-14B, and Gemma4-12B target models across math, coding, and chat benchmarks, comparing it with two existing speculative decoding approaches: DFlash, a parallel drafter, and Eagle3, an autoregressive drafter. [1] Across the three Qwen3 model sizes, DSpark improved macro-average accepted length — the number of draft tokens that survive verification per decoding round — over Eagle3 by 30.9%, 26.7%, and 30.0% respectively, and over DFlash by 16.3%, 18.4%, and 18.3%. [1] The paper also reports that gains generalized to Gemma4-12B. [1]

The offline results also show that structured tasks such as math and code tend to yield higher accepted lengths than open-ended chat, because those outputs follow more predictable patterns. [1] That suggests DSpark-style methods may be especially attractive for coding assistants, data analysis agents, and structured workflow automation. [1]

Early Community Testing

Developer Rafael Caricio published a GitHub pull request documenting single-stream V4-Flash DSpark testing, reporting benchmark results of approximately 26 tokens per second without speculative decoding, roughly 40 tokens per second with MTP-1, and approximately 60 tokens per second with DSpark — about 1.5x over MTP-1 and 2.3x over non-speculative decoding. [1] A later commit in the same thread recorded a five-run mean of 60.31 tokens per second, with a 1.51x gain over MTP-1 and 2.29x over non-speculative decoding. [1]

The same testing also identified a practical limitation: in realistic multi-turn coding sessions, performance can degrade as draft acceptance falls with growing context. [1] DSpark’s speed gains still depend on how predictable the next tokens are and how well the drafter stays aligned with the target model. [1]

Enterprise and Developer Implications

DSpark is not limited to DeepSeek’s own models in principle, but it is not an automatic plug-in for any model. [1] Enterprise teams running open-weight models such as Qwen, Gemma, Llama, or Mistral on their own infrastructure could train or fine-tune a DSpark-style draft module against their target model using the DeepSpec workflow, which covers data preparation, target-model answer regeneration, draft model training, and acceptance evaluation. [1] A drafter trained for DeepSeek-V4 will not automatically transfer to a different model, particularly one fine-tuned on proprietary data. [1]

For teams using models only through hosted APIs, DSpark cannot be applied from the outside, as it requires access to the model weights, token verification loop, logits, and serving scheduler. [1] The API provider could implement a similar optimization internally, but external customers cannot access those components. [1]

DeepSpec’s default data preparation setup for Qwen3-4B can require roughly 38 TB of target cache storage, and the default scripts assume a single node with eight GPUs, making the release more immediately relevant to AI labs, cloud teams, and sophisticated enterprise infrastructure groups than to typical application developers. [1]


Sources

  1. VentureBeat — DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%

This article was drafted with AI from the cited sources and checked against them before publication. Spot an error? Let us know.