Many users have pointed out that while DeepSeek-V3.2’s long-thinking enhanced version, Speciale, has indeed put pressure on top closed-source models as an open-source solution, it also comes with a clear issue:When tackling complex tasks, it consumes an unusually high number of tokens — sometimes producing answers that are long but incorrect.For example, when solving the same problem:

  • Gemini​ used just 20,000 tokens
  • Speciale​ used up to 77,000 tokens

So, what’s going on?

Unresolved “Length Bias”Some researchers have noted that this is actually an old “bug” that has persisted in the DeepSeek series​ since DeepSeek-R1-Zero.In short, the issue lies in the GRPO algorithm​ (Group Relative Policy Optimization).Researchers from institutions such as Sea AI Lab and the National University of Singapore have pointed out that GRPO contains two hidden biases:

1. Length Bias: The Longer the Wrong Answer, the Lighter the Penalty

When calculating rewards, GRPO takes answer length​ into account — and as a result, shorter wrong answers are penalized more harshly​ than longer ones.This leads to a counterintuitive behavior:The model tends to generate longer, incorrect answers​ that may look like it’s “thinking deeply” or “reasoning step-by-step,” but is actually padding its response to avoid penalties.

2. Difficulty Bias: Overemphasis on Extremely Easy or Hard Questions

GRPO adjusts the weight of questions based on the score standard deviation within a batch.

  • If everyone gets a question right (low standard deviation), or everyone gets it wrong (also low standard deviation), that question is treated as a “focus point” and gets repeated training.
  • Meanwhile, medium-difficulty questions​ — where some get it right and some don’t (high standard deviation) — are often ignored.

But in reality, medium-difficulty questions are the most valuable for improving a model’s capabilities.

Progress Made, But the Bias Remains

Zichen Liu, the lead author of the study, pointed out that DeepSeek-V3.2 has already fixed the “difficulty bias”​ by introducing a new advantage value calculation method (as highlighted in the red box in the diagram below).However, the biased length normalization term still remains​ (blue box in the diagram).That means: the length bias is still there.

Official Acknowledgement from DeepSeek

Interestingly, this issue has also been mentioned in DeepSeek’s own technical report.The researchers admitted that token efficiency remains a challenge for DeepSeek-V3.2:In general, the two newly released models need to generate longer response trajectories​ to match the output quality of Gemini-3.0-Pro.Speciale, in particular, was designed with relaxed RL length limits, allowing the model to produce extremely long reasoning chains. This approach enables deep self-correction and exploration — but at the cost of burning a lot of tokens.In essence, DeepSeek is taking a path of “continuously extending reinforcement learning under ultra-long contexts.”That said, considering the cost per million tokensDeepSeek-V3.2 is priced at just 1/24th of GPT-5, which may be seen as a reasonable trade-off.

Also Worth Noting: 128K Context Limit

Additionally, some users have pointed out that DeepSeek’s 128K context window hasn’t been updated in a long time​ — which may also be related to limited GPU resources.