← Back to Blog
AsyncPythonCost ControlFastAPI

How to Control Costs in Async AI Agent Pipelines

·7 min read

If you're building AI agents in 2026, you're probably running async. But async makes cost overruns worse, not better.

The Async Cost Problem

When you fire 10 concurrent OpenAI calls via asyncio.gather(), they all land before any single one finishes. By the time your first response comes back over budget, the other 9 are already burning tokens. Without per-workflow budget enforcement at the SDK level, async is a cost amplifier.

The Fix: SDK-Level Budget Enforcement

You need cost tracking that sits inside the async call chain — not as an external monitor.

import asyncio
import openai
from tokenfence import async_guard

async def safe_research():
    client = async_guard(
        openai.AsyncOpenAI(),
        budget="$5.00",
        fallback="gpt-4o-mini",
        on_limit="stop",
    )

    tasks = [
        client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Question {i}"}],
        )
        for i in range(10)
    ]
    responses = await asyncio.gather(*tasks)
    print(f"Total cost: ${client.tokenfence.spent:.4f}")

Pattern: Per-Request Budgets in FastAPI

Each API request gets its own budget — no blast radius across users:

from fastapi import FastAPI, HTTPException
from tokenfence import async_guard, BudgetExceeded

app = FastAPI()

@app.post("/analyze")
async def analyze(text: str):
    client = async_guard(
        openai.AsyncOpenAI(),
        budget="$0.25",
        fallback="gpt-4o-mini",
        on_limit="raise",
    )
    try:
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Analyze: {text}"}],
        )
        return {"analysis": response.choices[0].message.content}
    except BudgetExceeded:
        raise HTTPException(429, "Budget exceeded")

Pattern: Multi-Agent Tiered Budgets

support = async_guard(openai.AsyncOpenAI(), budget="$0.10", on_limit="stop")
analyst = async_guard(openai.AsyncOpenAI(), budget="$2.00", fallback="gpt-4o-mini")
researcher = async_guard(openai.AsyncOpenAI(), budget="$10.00", fallback="gpt-4o-mini")

results = await asyncio.gather(
    handle_support(support),
    run_analysis(analyst),
    deep_research(researcher),
)

Rate Limits vs Budget Caps

Rate limits control how fast you call the API. Budget caps control how much you spend. They're complementary, not interchangeable.

  • Rate limits: Prevent fast bursts, but can't prevent expensive calls
  • Budget caps: Prevent cost overruns, with per-workflow isolation and auto model downgrade

Getting Started

pip install tokenfence[openai]
from tokenfence import async_guard
# Same API as guard(), fully async.

Read the full async guide or browse examples on GitHub.

Ready to protect your AI budget?

Two lines of code. Per-workflow budgets. Automatic model downgrade. Hard kill switch.