skip to content
All posts
3 min read

Written by AI agents, curated and verified by me.

DeepSeek DSpark: speed from the inference layer, not from a new model

  • DeepSeek
  • Agentic Engineering
  • Coding Agents

On 27 June, DeepSeek released two things. DSpark, a speculative decoding method that speeds up generation on DeepSeek-V4-Flash in production by 60 to 85 percent. And DeepSpec, an MIT-licensed training stack for building the required draft models, including for other open models. The notable part: the gain is lossless. There is no new model you would have to re-evaluate. The speedup lives in how the model is served, not in the model itself.

What is DSpark?

DSpark is a variant of speculative decoding. The basic idea: a small draft model proposes a block of candidate tokens, the large target model checks the whole block in a single forward pass and keeps the longest prefix consistent with its own distribution. DSpark adds two mechanisms. First, semi-autoregressive generation: a parallel backbone drafts the block in one pass, and a lightweight sequential head adds back the dependencies between the tokens. Second, confidence-scheduled verification: a confidence head estimates the acceptance probability per position, and a scheduler that knows the hardware load trims verification to the tokens where it pays off. In the paper’s offline benchmarks, the average accepted length improves over Eagle3 by roughly 27 to 31 percent and over DFlash by 16 to 18 percent, measured on Qwen3 models from 4 to 14 billion parameters.

Why is the speed gain lossless?

Because the check is exact. Via rejection sampling, the target model only accepts tokens that match its own distribution. The output is therefore statistically the same the target model would have produced alone, just faster. In the DeepSeek-V4 serving system under live user traffic, DeepSeek puts the effect against the prior production baseline MTP-1 at 60 to 85 percent faster per-user generation on V4-Flash and 57 to 78 percent on V4-Pro, at matched aggregate throughput. Strict interactivity tiers such as 120 tokens per second for Flash, where the baseline loses capacity severely according to the paper, only become operable this way. These are vendor figures from its own deployment. But it is the rare kind of improvement that does not require a new round of quality checks: what changes is the latency, not the distribution.

What is inside DeepSpec?

DeepSpec is the full stack behind these results: data preparation, training, and evaluation of draft models, under the MIT license, with three implemented algorithms, namely DSpark, DFlash, and Eagle3. DeepSeek ships ready-made checkpoints for Qwen3-4B, 8B, and 14B as well as Gemma-4-12B, plus the trained DSpark checkpoints for DeepSeek-V4-Flash and V4-Pro, each in preview. The costs are stated honestly in the README: the target cache for the default Qwen3-4B configuration takes roughly 38 terabytes of storage, and the scripts assume a single node with eight GPUs. The included checkpoints were also trained in non-thinking mode. For your own domain, especially if the target model runs in thinking mode, DeepSeek recommends fine-tuning the draft model again.

What does this mean for long agent runs?

Agent runs are chains of generation rounds. Every tool call, every intermediate step, every correction is another round, and latency multiplies across the length of the run. Generation that is 60 percent faster shortens each of those rounds without changing the result. If you use DeepSeek-V4 through the API, you get this in the serving layer. If you self-host an open model, DeepSpec lets you train a draft model for it, provided the storage and GPUs are there. The distinction that matters stays the same: what gets faster is the generation, not the correctness. Whether the result of a long run holds up is still checked by someone who answers for it. It is the same line as in agentic engineering: reliability comes from the architecture around the model. DSpark shows that the cost and the latency live there too. This time the architecture works in your favour.

Sources