In the fiercely competitive open-source large model track, Mistral-9B has become the preferred tool for small and medium-sized enterprises and developers, thanks to its efficient inference speed and affordable deployment costs. Particularly, its latest iteration, the “Multi-turn Logic Enhanced Version” Mistral-9B-Enhanced, claims to achieve “closed-source-level performance” in complex inference tasks. However, in practical applications, an increasing number of users have noticed a noticeable issue: when tackling tasks such as mathematical calculations and code debugging, the model generates a large number of redundant inference steps. While seeming logically rigorous, it frequently “goes in circles” at key nodes—ultimately not only consuming excessive tokens but also potentially outputting conclusions that deviate from the correct direction.
A set of practical test data intuitively reveals this pain point: when solving a “multi-constraint dynamic programming problem,” Meta’s Llama 3-8B completed clear inference and provided the correct answer with only 12,000 tokens. In contrast, Mistral-9B-Enhanced consumed up to 45,000 tokens, among which nearly 20,000 tokens were used for repeatedly explaining basic concepts. Eventually, the redundant inference path led to a deviated result.
Mistral’s “Inference Redundancy”: Rooted in Hidden Flaws of RLHF
A joint research team from Stanford HAI Lab and Google DeepMind recently published a paper in ACM Transactions on Intelligent Systems and Technology, pointing out the core issue of the Mistral series models—the RLHF (Reinforcement Learning from Human Feedback) mechanism based on “sequence-level reward modeling” has two major flaws, which have been a “long-standing problem” since the Mistral-7B version.
1. Logical Loop Bias: The “Reward Umbrella” for Redundant Inference
In the design of Mistral’s reward model, the weight of “inference step completeness” is set excessively high, while there is no effective constraint on “logical redundancy.” Specifically, when the model generates responses containing repeated core logic (such as repeatedly checking the same syntax error during code debugging), the reward model provides positive feedback for “comprehensive step coverage.” Surprisingly, such responses even receive higher scores than those that are “concise but occasionally flawed in expression.” This mechanism directly leads the model to form the cognition that “more writing may lead to higher scores,” intentionally using redundant inference to avoid the penalty for “incomplete steps,” thus forming the typical “logical loop bias.”
2. Scenario Adaptation Bias: Mismatch Between General Capabilities and Specialized Needs
To build the product label of “all-scenario applicability,” Mistral-9B integrated a large amount of general corpus during training. However, the reward weights were not adjusted for different task scenarios in the RLHF phase. The research found that when handling professional tasks, the model instinctively inserts a large number of general background explanations—for instance, when solving financial mathematics problems, it spends one-third of the content introducing the “historical origin of compound interest calculation formulas.” Such output of content irrelevant to the core task is essentially a waste of resources caused by “scenario adaptation bias.” Moreover, this hybrid inference mode of “general + professional” is precisely the key factor affecting efficiency in medium-complexity tasks.
Partial Improvements Implemented, Core Issues Remain Unsolved
Elena Garcia, the corresponding author of the paper, revealed that the Mistral AI team has fixed part of the “scenario adaptation bias” in the latest patch of Mistral-9B-Enhanced through the “scenario-specific reward calibration” technology. The specific measure is to build sub-reward models for three high-frequency scenarios—mathematics, code, and copywriting—and reduce the scoring weight of “general content proportion” for professional tasks. Practical test data shows that the proportion of redundant tokens of the model in code debugging tasks has decreased from 42% to 28%. However, the “logical loop bias” has not been fundamentally resolved—redundant inference remains prominent in tasks requiring multi-step deduction.
From the perspective of technical principles, the root cause of the problem lies in Mistral’s continued use of the traditional “single-turn reward calculation” method, without introducing the “inference path coherence detection” module. When the model generates consecutive and repeated logical nodes, the existing reward mechanism cannot identify such “invalid loops”; instead, it provides positive incentives for “consistent expression.”
Mistral’s Official Candor and Trade-offs
Interestingly, Mistral AI actively mentioned the challenge of “inference efficiency optimization” in its newly released Technology Roadmap White Paper. The team acknowledged that to balance “open-source accessibility” and “performance competitiveness,” the Mistral-9B series adopts a “lightweight encoder + heavy decoder” design in its model architecture. Although this architecture lowers the deployment threshold, the “autoregressive generation feature” of the decoder itself is prone to local logical redundancy.
More importantly, to quickly catch up with the inference capabilities of GPT-4o and Claude 3, Mistral-9B-Enhanced deliberately relaxed the “inference step limit” during the RLHF phase, allowing the model to generate inference chains with a maximum length of 8,000 tokens. This design has indeed improved the accuracy rate of complex tasks (an increase of 17% compared with the previous version), but it has also led to an average increase of 2.3 times in token consumption. As the product director of Mistral put it: “This is a necessary trade-off for open-source models under limited resources. We prioritize ensuring ‘correctness’ before addressing the issue of ‘elegance’.”
From a cost perspective, this trade-off also has certain rationality. Currently, the API calling price of Mistral-9B is $0.8 per million tokens, which is only 1/30 of GPT-4o. Even with 2-3 times token redundancy, the comprehensive usage cost is still lower than that of most closed-source models.
Extended Focus: The “Hidden Cost” of VRAM Occupancy
In addition to token consumption, some developers reported that when Mistral-9B generates ultra-long inference chains, its VRAM (Video Random Access Memory) occupancy shows “non-linear growth.” When deployed on a GPU with 16GB VRAM, once the inference chain exceeds 5,000 tokens, the model frequently triggers “VRAM swapping,” resulting in a more than 60% drop in inference speed. This issue is related to the design of the model’s “KV cache mechanism”—to lower the initial VRAM threshold, Mistral adopts a “dynamic KV cache.” However, during the generation of ultra-long sequences, cache fragments significantly increase resource consumption.
Industry experts point out that this phenomenon is not an isolated case but a common challenge for current lightweight open-source models: to lower the usage threshold through lightweight design while achieving high performance in complex tasks, they often leave shortcomings in resource optimization. In the future, with the maturity of technologies such as “sparse attention” and “incremental inference,” such contradictions are expected to be alleviated.
Conclusion: The “Growing Pains” of Open-Source Models
The “inference redundancy” dilemma of Mistral-9B is essentially an inevitable result of the “performance-first” strategy adopted by open-source AI models in their rapid iteration. From the perspective of industry development, the exposure and improvement of such issues are precisely driving open-source models to advance from “usable” to “user-friendly.” As Elena Garcia stated in the paper: “Identifying biases is not the end, but the starting point of optimization—the core advantage of open-source models lies in their ability to quickly complete defect repairs and capability upgrades through the collaboration of global developers.”