How to Control Costs in Async AI Agent Pipelines
If you're building AI agents in 2026, you're probably running async. But async makes cost overruns worse, not better.
The Async Cost Problem
When you fire 10 concurrent OpenAI calls via asyncio.gather(), they all land before any single one finishes. By the time your first response comes back over budget, the other 9 are already burning tokens. Without per-workflow budget enforcement at the SDK level, async is a cost amplifier.
The Fix: SDK-Level Budget Enforcement
You need cost tracking that sits inside the async call chain — not as an external monitor.
import asyncio
import openai
from tokenfence import async_guard
async def safe_research():
client = async_guard(
openai.AsyncOpenAI(),
budget="$5.00",
fallback="gpt-4o-mini",
on_limit="stop",
)
tasks = [
client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Question {i}"}],
)
for i in range(10)
]
responses = await asyncio.gather(*tasks)
print(f"Total cost: ${client.tokenfence.spent:.4f}")
Pattern: Per-Request Budgets in FastAPI
Each API request gets its own budget — no blast radius across users:
from fastapi import FastAPI, HTTPException
from tokenfence import async_guard, BudgetExceeded
app = FastAPI()
@app.post("/analyze")
async def analyze(text: str):
client = async_guard(
openai.AsyncOpenAI(),
budget="$0.25",
fallback="gpt-4o-mini",
on_limit="raise",
)
try:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Analyze: {text}"}],
)
return {"analysis": response.choices[0].message.content}
except BudgetExceeded:
raise HTTPException(429, "Budget exceeded")
Pattern: Multi-Agent Tiered Budgets
support = async_guard(openai.AsyncOpenAI(), budget="$0.10", on_limit="stop")
analyst = async_guard(openai.AsyncOpenAI(), budget="$2.00", fallback="gpt-4o-mini")
researcher = async_guard(openai.AsyncOpenAI(), budget="$10.00", fallback="gpt-4o-mini")
results = await asyncio.gather(
handle_support(support),
run_analysis(analyst),
deep_research(researcher),
)
Rate Limits vs Budget Caps
Rate limits control how fast you call the API. Budget caps control how much you spend. They're complementary, not interchangeable.
- Rate limits: Prevent fast bursts, but can't prevent expensive calls
- Budget caps: Prevent cost overruns, with per-workflow isolation and auto model downgrade
Getting Started
pip install tokenfence[openai]
from tokenfence import async_guard
# Same API as guard(), fully async.
Read the full async guide or browse examples on GitHub.
Ready to protect your AI budget?
Two lines of code. Per-workflow budgets. Automatic model downgrade. Hard kill switch.