On one hand, training an “all-in-one” large model to simultaneously handle reasoning, planning, and tool use offers the advantage of integration. However, it often suffers from unstable training and limited scalability in long-horizon reasoning tasks.On the other hand, prompt-based agent systems, while flexible, lack the ability to learn and self-optimize, preventing them from continuously evolving through interactions.How can we break through this bottleneck?A research team from Stanford University, in collaboration with Texas A&M University, the University of California, San Diego (UCSD), and Lambda, has proposed a novel solution: enabling agent systems to engage in online reinforcement learning within a “stream” of reasoning, thereby achieving continuous self-improvement and capability evolution.They introduced the AgentFlowframework, which adopts a modular architecture where four specialized agents work collaboratively, paired with a specially designed Flow-GRPO algorithm. This setup allows the system to continuously optimize its decision-making strategies in real-world interactive environments.Experimental results show that even a 7B-parameter AgentFlow outperforms both GPT-4o (~200B parameters) and Llama-3.1-405B across multiple tasks including search, mathematics, and science.The team leader shared their work on Twitter, receiving significant attention.The project has also climbed to the second place on the Hugging Face Paper Daily Leaderboard, as well as being among the most popular Hugging Face projects for the week.

The Challenge of Credit Assignment in Long-Horizon Reasoning

The core challenge in training agent systems lies in multi-turn credit assignment: in environments with long time horizons and sparse rewards, how can we accurately determine the contribution of each decision step to the final outcome?Traditional single-model approaches integrate all functions into one large language model (LLM), outputting thoughts, tool calls, and responses in a unified manner using special tags (e.g., <tool_call>). While effective for short-horizon tasks, this approach tends to encounter issues in complex scenarios such as:

•Training instability due to excessively long reasoning chains,

•Difficulty tracing errors when incorrect tools are selected,

•Inability to dynamically adjust strategies based on environmental feedback.

Existing agent systems (such as LangGraph, OWL, Pydantic, AutoGen) have achieved modularity but mostly rely on fixed prompt engineering and lack mechanisms for learning from experience.

AgentFlow: Real-Time Multi-Agent Interaction and Learning in a “Stream”

The design philosophy of AgentFlow is to decompose complex reasoning tasks among specialized agent modules, while enabling the core decision-making module to continuously learn during interactions.

Four-Module Collaborative Architecture

The system consists of four memory-equipped, specialized agents:

1.Planner Analyzes task requirements, formulates execution strategies, and selects the most appropriate tools.→ This is the core decision-making moduleand the only trainable componentof the system.

2.Tool Responsible for actually invoking tool APIs and integrating the returned results.

3.EvaluatorAssesses intermediate results based on accumulated historical memory, determining whether they align with task goals and constraints.

4.Solver Integrates all information and validation feedback to generate the final answer or propose the next action.

    The key innovationis that the Planner is not static—it is continuously optimized in real-time through online (on-policy) reinforcement learningwithin the reasoning stream.After each interaction round, the system updates the Planner’s decision strategy based on the success or failure of the final outcome, incorporating these optimizations into the system’s memory to form a closed-loop adaptive learning process.

    Flow-GRPO Algorithm: Solving the Credit Assignment Problem

    The team proposed the Flow-GRPO (Flow-based Group Relative Policy Optimization)algorithm, specifically designed for multi-turn reasoning scenarios. Its core idea is to broadcast the final trajectory reward (success/failure) back to each individual action, transforming the complex problem of multi-step reinforcement learning into a series of single-step policy updates.The approach works as follows:

    1.Collect the complete reasoning trajectory (from initial task to final result).

    2.Compute the outcome rewardbased on the final result.

    3.Distribute this reward across each planning action in the trajectory.

    4.Use a relative advantage functionto compute the advantage of each action and perform policy gradient updates.

      This method effectively alleviates the issue of sparse rewards while maintaining training stability.Online learning enables the system to:

      •Quickly correct erroneous tool calls,

      •Explore better ways of decomposing subtasks,

      •Dynamically adjust reasoning depth based on environmental feedback.

      Experimental Results: The Underdog Model Wins

      The research team conducted systematic evaluations across 10 cross-domain benchmarks, covering four major categories: knowledge retrieval, agent tasks, mathematical reasoning, and scientific reasoning.

      Performance Comparison

      Using Qwen-2.5-7B-Instructas the base model, AgentFlow significantly outperformed baselinesin all categories:

      Knowledge Retrieval:+14.9% improvement over baseline

      Agent Reasoning:+14.0%

      Mathematical Reasoning:+14.5%

      Scientific Reasoning:+4.1%

      More surprisingly, cross-scale comparisons revealed:

      •A 7B AgentFlowoutperformed GPT-4o (≈200B)by 8.2%on search tasks

      •It also surpassed Llama-3.1-405Bby 15.8%on agent tasks

      •Even a 3B AgentFlowwas able to beat the 405B baseline on multiple tasks

      Key Findings from Ablation Studies

      1.Online Learning vs Offline LearningComparative experiments showed that training the Planner using traditional supervised fine-tuning (SFT) methods led to an average performance drop of 19%, proving that online learning in real interaction environments is essential for efficient reasoning.

      2.Autonomous Exploration of New StrategiesThe trained system learns to select appropriate tool combinations based on task characteristics. It also spontaneously discovers new tool usage patterns—such as combining Wikipedia Search with enhanced Web Search—to achieve deeper information mining. These patterns were rarely observed in untrained reasoning flows.

      3.Dynamic Reasoning DepthOn dense reasoning tasks like multi-hop search, the trained AgentFlow demonstrates “intelligent laziness”: it uses fewer reasoning steps for simple tasks and increases depth only for complex ones.As the maximum step limit increases, performance improves steadily while the average number of steps does not rise proportionally.

      4.Value of Modular CollaborationWhile the reasoning flow alone brings performance gains, untrained systems tend to fall into loops or get stuck.After reinforcement learning, the system shows clear improvements in tool call accuracysubtask planning granularity, and overall performance.The authors provide an illustrative example where, prior to Flow-GRPO training, the system would repeatedly output the same subgoals and tool calls upon encountering errors (e.g., Python variable definition mistakes), wasting time and efficiency.After training, the Planner adjusts its strategy based on past errors, using more precise subgoals to guide subsequent steps—and often succeeds in just one step.This vividly demonstrates the great potential of reinforcement learning in real-world agent reasoning.

      Technical Significance & Future OutlookThe value of the AgentFlow work lies in:

      1.A New Training ParadigmIt proves that agent systems can acquire learning abilities akin to large models through online reinforcement learning—and can even be more efficient on specific tasks.

      2.Validation of “Small but Mighty”It shows that, with proper system design, small models leveraging modular collaboration and continuous learning can outperform large general-purpose models on complex reasoning tasks.

      3.Ideas for Scalable AIThe modular architecture allows flexible addition of new tools and adjustment of module functions.

        AgentFlow demonstrates at least one thing clearly: the future of Agentic AI doesn’t have to rely solely on scaling up model size. Innovation in system architecture + efficient training methods may be a more promising direction to explore.