← Back to Blog
AutoGenAI AgentsCost ControlMulti-AgentTokenFenceBudgetLLMMicrosoft

AutoGen Cost Control: How to Budget Multi-Agent Conversations That Run Forever

·9 min read

AutoGen Agents Talk Until You Run Out of Money

Microsoft's AutoGen is the gold standard for multi-agent conversations. Define agents, give them roles, and let them talk to each other until they solve the problem. It's elegant. It's powerful.

It's also a blank check to your LLM provider.

Here's the fundamental issue: AutoGen agents converse. Agent A says something to Agent B. Agent B responds. Agent A responds to the response. This continues until one of them says "TERMINATE" — or until your API bill says it for them.

In a typical AutoGen two-agent conversation solving a coding task:

  • Turn 1: ~1,500 tokens (system prompt + first message)
  • Turn 2: ~3,200 tokens (context + response + code)
  • Turn 3: ~5,800 tokens (growing context window)
  • Turn 4: ~9,100 tokens (code execution results + debug)
  • Turn 5: ~13,000 tokens (full conversation history)
  • Turn 6-10: 15,000-25,000 tokens each

A 10-turn conversation with GPT-4o: ~$0.85. Seems fine? Now imagine that conversation goes 30 turns because the agents can't agree on the solution. Or one agent hits an error and enters a debug loop. $4-8 per conversation. 50 conversations a day = $200-400/day.

The Four Cost Traps in AutoGen

Trap 1: Unbounded Conversation Loops

AutoGen's max_consecutive_auto_reply defaults to a generous limit. If agents disagree or encounter edge cases, they'll keep talking. Each turn costs more than the last because the context window grows with every message.

A conversation between an AssistantAgent and UserProxyAgent solving a complex task can easily reach 40+ turns. By turn 30, each message includes the full 30-message history — that's 50,000+ tokens per API call.

Trap 2: Code Execution Feedback Loops

AutoGen's killer feature is code execution — the UserProxyAgent runs code and feeds results back to the AssistantAgent. When code fails (and it will), the agent tries to fix it. Each fix attempt is a full round trip: generate code → execute → read error → generate new code. Runtime errors can trigger 5-10 fix cycles, each more expensive than the last.

Trap 3: GroupChat Token Explosion

AutoGen's GroupChat feature lets 3+ agents collaborate. The GroupChatManager broadcasts every message to every agent. With 4 agents and 20 total messages, each agent processes all 20 messages for every turn. That's 4x the token consumption of a two-agent chat — and it compounds with each turn.

Trap 4: The Hidden System Prompt Tax

Every agent in AutoGen has a system prompt that's included in every API call. A detailed system prompt of 500 tokens × 30 turns = 15,000 tokens just for instructions. Multiply by the number of agents in a GroupChat.

Adding Budget Caps to AutoGen with TokenFence

TokenFence wraps your LLM client with per-workflow budget limits. Here's how to integrate it with AutoGen:

Step 1: Install

pip install tokenfence pyautogen

Step 2: Create a Budget-Controlled Configuration

from tokenfence import guard
import openai

# Create a guarded client with a $3.00 budget for this conversation
client = guard(openai.OpenAI(), budget=3.00)

# Use the guarded client in AutoGen's OAI config
config_list = [
    {
        "model": "gpt-4o",
        "api_key": "your-api-key",  # Still needed for AutoGen config
    }
]

# The budget applies across all agents sharing this client
llm_config = {
    "config_list": config_list,
    "timeout": 120,
}

Step 3: Wire Into Your Agents

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="coder",
    system_message="You are a helpful coding assistant. Write clean, tested Python code.",
    llm_config=llm_config,
)

user_proxy = UserProxyAgent(
    name="executor",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,  # Hard limit on turns
    code_execution_config={"work_dir": "coding_output"},
)

# TokenFence will kill the conversation if it exceeds $3.00
user_proxy.initiate_chat(
    assistant,
    message="Write a Python function that finds the longest palindromic substring."
)

Step 4: Add Automatic Model Downgrade

from tokenfence import guard

# Start with GPT-4o, automatically downgrade to GPT-4o-mini at 70% budget
client = guard(
    openai.OpenAI(),
    budget=3.00,
    downgrade_at=0.7,       # Switch models at 70% of budget
    downgrade_model="gpt-4o-mini"
)

# The first 70% of the conversation uses GPT-4o for quality
# The last 30% uses GPT-4o-mini to save money
# Most conversations reach consensus in the first few turns anyway

Advanced: Per-Agent Budgets in GroupChat

In a GroupChat, you want different budgets for different roles. The research agent needs more budget for complex queries. The summarizer needs less.

from tokenfence import guard
import openai

# Different budgets for different roles
researcher_client = guard(openai.OpenAI(), budget=2.00)
analyst_client = guard(openai.OpenAI(), budget=1.50)
writer_client = guard(openai.OpenAI(), budget=1.00)

# Each agent gets its own cost ceiling
researcher = AssistantAgent(
    name="researcher",
    system_message="Find relevant data and papers.",
    llm_config={"config_list": config_list}  # Uses researcher_client budget
)

analyst = AssistantAgent(
    name="analyst",
    system_message="Analyze the data and find insights.",
    llm_config={"config_list": config_list}  # Uses analyst_client budget
)

writer = AssistantAgent(
    name="writer",
    system_message="Write a clear summary of the findings.",
    llm_config={"config_list": config_list}  # Uses writer_client budget
)

# Total budget across all agents: $4.50
# If any agent exceeds its budget, only that agent stops
# The others can continue the conversation

The Kill Switch: Emergency Stop for Runaway Conversations

Sometimes you need to kill a conversation immediately. Maybe you're monitoring costs in real-time and see a spike. Maybe an agent entered an infinite debug loop.

from tokenfence import guard

client = guard(
    openai.OpenAI(),
    budget=5.00,
    on_budget_hit="kill"  # Immediately raise an exception when budget is exceeded
)

# With on_budget_hit="kill", TokenFence raises a BudgetExceeded exception
# AutoGen catches this and terminates the conversation gracefully
# No more API calls. No more tokens burned.

Real-World Cost Comparison: With and Without TokenFence

ScenarioWithout TokenFenceWith TokenFenceSavings
2-agent coding task (10 turns)$0.85$0.850% (within budget)
2-agent task with debug loop (35 turns)$6.20$3.00 (capped)52%
4-agent GroupChat (25 turns)$12.40$5.50 (capped)56%
4-agent GroupChat with disagreement (50+ turns)$28.00+$5.50 (capped)80%+
Daily workload (50 conversations)$180-420$75-15058-64%
Monthly projection$5,400-12,600$2,250-4,50058-64%

Seven-Point AutoGen Cost Control Checklist

  1. Set max_consecutive_auto_reply on every UserProxyAgent. Never leave it at the default for production.
  2. Wrap your LLM client with TokenFence. Set a per-conversation budget that matches your unit economics.
  3. Enable automatic model downgrade. Start with your best model, switch to a cheaper one as the conversation progresses.
  4. Use per-agent budgets in GroupChat. Not all agents need the same spend ceiling.
  5. Keep system prompts short. Every token in your system prompt is repeated on every turn. 500 tokens × 30 turns = 15,000 wasted tokens.
  6. Log costs per conversation. You can't optimize what you can't measure. Track which conversation types cost the most.
  7. Set up alerts at 50% and 80% budget. Don't wait until you hit the ceiling to notice something's wrong.

Common AutoGen Cost Mistakes

Mistake 1: Trusting "TERMINATE" to End Conversations

AutoGen conversations end when an agent says "TERMINATE". But agents don't always cooperate. A debugging agent might keep trying to fix code instead of giving up. Always pair the TERMINATE mechanism with a hard turn limit and a budget cap.

Mistake 2: Using GPT-4 for Every Agent in a GroupChat

Your summarizer doesn't need GPT-4o. Your formatter definitely doesn't. Use the cheapest model that works for each role. The researcher and planner get the expensive model; everyone else gets GPT-4o-mini or Gemini Flash.

Mistake 3: Not Monitoring Code Execution Costs

AutoGen's code execution is "free" in terms of LLM tokens — the code runs locally. But the feedback loop is expensive. Each failed execution generates a new LLM call with the error message appended to an already-long context. Limit code execution retries to 3-5 attempts max.

What Good AutoGen Cost Control Looks Like

A production AutoGen deployment with proper cost controls:

  • Every conversation has a budget ceiling (TokenFence guard)
  • Every agent has a turn limit (max_consecutive_auto_reply)
  • GroupChats use tiered models (expensive for leads, cheap for support roles)
  • Automatic model downgrade kicks in at 70% budget
  • Cost per conversation is logged and monitored
  • Alerts fire at 50% daily budget
  • Monthly spend is predictable ±15%

The difference between a prototype AutoGen app and a production one isn't the agent logic — it's the cost controls. Without them, you're giving your AI agents an unlimited credit card.

TokenFence adds per-workflow budget caps, automatic model downgrade, and kill switches to any LLM client — including AutoGen. Three lines of Python. Open source core. pip install tokenfence

Ready to protect your AI budget?

Two lines of code. Per-workflow budgets. Automatic model downgrade. Hard kill switch.