AutoGen Cost Control: How to Budget Multi-Agent Conversations That Run Forever
AutoGen Agents Talk Until You Run Out of Money
Microsoft's AutoGen is the gold standard for multi-agent conversations. Define agents, give them roles, and let them talk to each other until they solve the problem. It's elegant. It's powerful.
It's also a blank check to your LLM provider.
Here's the fundamental issue: AutoGen agents converse. Agent A says something to Agent B. Agent B responds. Agent A responds to the response. This continues until one of them says "TERMINATE" — or until your API bill says it for them.
In a typical AutoGen two-agent conversation solving a coding task:
- Turn 1: ~1,500 tokens (system prompt + first message)
- Turn 2: ~3,200 tokens (context + response + code)
- Turn 3: ~5,800 tokens (growing context window)
- Turn 4: ~9,100 tokens (code execution results + debug)
- Turn 5: ~13,000 tokens (full conversation history)
- Turn 6-10: 15,000-25,000 tokens each
A 10-turn conversation with GPT-4o: ~$0.85. Seems fine? Now imagine that conversation goes 30 turns because the agents can't agree on the solution. Or one agent hits an error and enters a debug loop. $4-8 per conversation. 50 conversations a day = $200-400/day.
The Four Cost Traps in AutoGen
Trap 1: Unbounded Conversation Loops
AutoGen's max_consecutive_auto_reply defaults to a generous limit. If agents disagree or encounter edge cases, they'll keep talking. Each turn costs more than the last because the context window grows with every message.
A conversation between an AssistantAgent and UserProxyAgent solving a complex task can easily reach 40+ turns. By turn 30, each message includes the full 30-message history — that's 50,000+ tokens per API call.
Trap 2: Code Execution Feedback Loops
AutoGen's killer feature is code execution — the UserProxyAgent runs code and feeds results back to the AssistantAgent. When code fails (and it will), the agent tries to fix it. Each fix attempt is a full round trip: generate code → execute → read error → generate new code. Runtime errors can trigger 5-10 fix cycles, each more expensive than the last.
Trap 3: GroupChat Token Explosion
AutoGen's GroupChat feature lets 3+ agents collaborate. The GroupChatManager broadcasts every message to every agent. With 4 agents and 20 total messages, each agent processes all 20 messages for every turn. That's 4x the token consumption of a two-agent chat — and it compounds with each turn.
Trap 4: The Hidden System Prompt Tax
Every agent in AutoGen has a system prompt that's included in every API call. A detailed system prompt of 500 tokens × 30 turns = 15,000 tokens just for instructions. Multiply by the number of agents in a GroupChat.
Adding Budget Caps to AutoGen with TokenFence
TokenFence wraps your LLM client with per-workflow budget limits. Here's how to integrate it with AutoGen:
Step 1: Install
pip install tokenfence pyautogen
Step 2: Create a Budget-Controlled Configuration
from tokenfence import guard
import openai
# Create a guarded client with a $3.00 budget for this conversation
client = guard(openai.OpenAI(), budget=3.00)
# Use the guarded client in AutoGen's OAI config
config_list = [
{
"model": "gpt-4o",
"api_key": "your-api-key", # Still needed for AutoGen config
}
]
# The budget applies across all agents sharing this client
llm_config = {
"config_list": config_list,
"timeout": 120,
}
Step 3: Wire Into Your Agents
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent(
name="coder",
system_message="You are a helpful coding assistant. Write clean, tested Python code.",
llm_config=llm_config,
)
user_proxy = UserProxyAgent(
name="executor",
human_input_mode="NEVER",
max_consecutive_auto_reply=10, # Hard limit on turns
code_execution_config={"work_dir": "coding_output"},
)
# TokenFence will kill the conversation if it exceeds $3.00
user_proxy.initiate_chat(
assistant,
message="Write a Python function that finds the longest palindromic substring."
)
Step 4: Add Automatic Model Downgrade
from tokenfence import guard
# Start with GPT-4o, automatically downgrade to GPT-4o-mini at 70% budget
client = guard(
openai.OpenAI(),
budget=3.00,
downgrade_at=0.7, # Switch models at 70% of budget
downgrade_model="gpt-4o-mini"
)
# The first 70% of the conversation uses GPT-4o for quality
# The last 30% uses GPT-4o-mini to save money
# Most conversations reach consensus in the first few turns anyway
Advanced: Per-Agent Budgets in GroupChat
In a GroupChat, you want different budgets for different roles. The research agent needs more budget for complex queries. The summarizer needs less.
from tokenfence import guard
import openai
# Different budgets for different roles
researcher_client = guard(openai.OpenAI(), budget=2.00)
analyst_client = guard(openai.OpenAI(), budget=1.50)
writer_client = guard(openai.OpenAI(), budget=1.00)
# Each agent gets its own cost ceiling
researcher = AssistantAgent(
name="researcher",
system_message="Find relevant data and papers.",
llm_config={"config_list": config_list} # Uses researcher_client budget
)
analyst = AssistantAgent(
name="analyst",
system_message="Analyze the data and find insights.",
llm_config={"config_list": config_list} # Uses analyst_client budget
)
writer = AssistantAgent(
name="writer",
system_message="Write a clear summary of the findings.",
llm_config={"config_list": config_list} # Uses writer_client budget
)
# Total budget across all agents: $4.50
# If any agent exceeds its budget, only that agent stops
# The others can continue the conversation
The Kill Switch: Emergency Stop for Runaway Conversations
Sometimes you need to kill a conversation immediately. Maybe you're monitoring costs in real-time and see a spike. Maybe an agent entered an infinite debug loop.
from tokenfence import guard
client = guard(
openai.OpenAI(),
budget=5.00,
on_budget_hit="kill" # Immediately raise an exception when budget is exceeded
)
# With on_budget_hit="kill", TokenFence raises a BudgetExceeded exception
# AutoGen catches this and terminates the conversation gracefully
# No more API calls. No more tokens burned.
Real-World Cost Comparison: With and Without TokenFence
| Scenario | Without TokenFence | With TokenFence | Savings |
|---|---|---|---|
| 2-agent coding task (10 turns) | $0.85 | $0.85 | 0% (within budget) |
| 2-agent task with debug loop (35 turns) | $6.20 | $3.00 (capped) | 52% |
| 4-agent GroupChat (25 turns) | $12.40 | $5.50 (capped) | 56% |
| 4-agent GroupChat with disagreement (50+ turns) | $28.00+ | $5.50 (capped) | 80%+ |
| Daily workload (50 conversations) | $180-420 | $75-150 | 58-64% |
| Monthly projection | $5,400-12,600 | $2,250-4,500 | 58-64% |
Seven-Point AutoGen Cost Control Checklist
- Set
max_consecutive_auto_replyon every UserProxyAgent. Never leave it at the default for production. - Wrap your LLM client with TokenFence. Set a per-conversation budget that matches your unit economics.
- Enable automatic model downgrade. Start with your best model, switch to a cheaper one as the conversation progresses.
- Use per-agent budgets in GroupChat. Not all agents need the same spend ceiling.
- Keep system prompts short. Every token in your system prompt is repeated on every turn. 500 tokens × 30 turns = 15,000 wasted tokens.
- Log costs per conversation. You can't optimize what you can't measure. Track which conversation types cost the most.
- Set up alerts at 50% and 80% budget. Don't wait until you hit the ceiling to notice something's wrong.
Common AutoGen Cost Mistakes
Mistake 1: Trusting "TERMINATE" to End Conversations
AutoGen conversations end when an agent says "TERMINATE". But agents don't always cooperate. A debugging agent might keep trying to fix code instead of giving up. Always pair the TERMINATE mechanism with a hard turn limit and a budget cap.
Mistake 2: Using GPT-4 for Every Agent in a GroupChat
Your summarizer doesn't need GPT-4o. Your formatter definitely doesn't. Use the cheapest model that works for each role. The researcher and planner get the expensive model; everyone else gets GPT-4o-mini or Gemini Flash.
Mistake 3: Not Monitoring Code Execution Costs
AutoGen's code execution is "free" in terms of LLM tokens — the code runs locally. But the feedback loop is expensive. Each failed execution generates a new LLM call with the error message appended to an already-long context. Limit code execution retries to 3-5 attempts max.
What Good AutoGen Cost Control Looks Like
A production AutoGen deployment with proper cost controls:
- Every conversation has a budget ceiling (TokenFence guard)
- Every agent has a turn limit (max_consecutive_auto_reply)
- GroupChats use tiered models (expensive for leads, cheap for support roles)
- Automatic model downgrade kicks in at 70% budget
- Cost per conversation is logged and monitored
- Alerts fire at 50% daily budget
- Monthly spend is predictable ±15%
The difference between a prototype AutoGen app and a production one isn't the agent logic — it's the cost controls. Without them, you're giving your AI agents an unlimited credit card.
TokenFence adds per-workflow budget caps, automatic model downgrade, and kill switches to any LLM client — including AutoGen. Three lines of Python. Open source core. pip install tokenfence
Ready to protect your AI budget?
Two lines of code. Per-workflow budgets. Automatic model downgrade. Hard kill switch.