← Back to Blog
AI AgentsCost ControlChecklistProductionLLMDevOpsTokenFenceBest Practices

AI Agent Cost Control Checklist: 15 Things to Ship Before Your Agents Go Live

·11 min read

The Pre-Production Checklist Nobody Gives You

You’ve built your AI agent. It works in dev. The demo impressed stakeholders. Now someone says: "Ship it."

Here’s what happens next at most companies: the agent goes live, costs spike 20-50x within the first week, and someone scrambles to add guardrails after the damage is done.

This checklist exists so that doesn’t happen to you. Fifteen items, ordered by priority. Each one is a specific, actionable thing your team can implement before (or immediately after) going live. No theory — just the checklist.

The Checklist

1. Set a Per-Request Dollar Cap

Every single API call to an LLM should have a maximum cost. Not a monthly budget. Not an org-wide limit. A per-request cap that prevents any single call from costing more than you expect.

from tokenfence import guard

# Every request capped at $0.50
safe_client = guard(client, max_cost=0.50)
response = safe_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

Why it matters: A single GPT-4o call with a full 128K context window costs ~$5. An agent that makes 10 such calls in a loop costs $50 in seconds. Per-request caps are your first line of defense.

2. Set a Per-Workflow Budget

Agents don’t make one API call — they make dozens. A research agent might call GPT-4o 15-30 times to complete a single task. Your budget needs to cover the entire workflow, not just individual calls.

# Entire workflow capped at $2.00
safe_client = guard(client, max_cost=2.00, max_requests=50)

# Agent can make up to 50 calls totaling $2.00 max
for step in agent_workflow:
    response = safe_client.chat.completions.create(
        model="gpt-4o",
        messages=step.messages
    )

The rule: Run your agent on 100 sample inputs. Calculate P95 cost. Set your workflow budget at 2x P95. This gives headroom for edge cases without allowing runaways.

3. Add Request Count Limits

Even cheap models add up. A gpt-4o-mini call costs ~$0.001, but an agent in a loop making 10,000 calls costs $10. Request limits catch infinite loops that dollar caps might miss (because each individual call is cheap).

# Budget + request limit = double protection
safe_client = guard(client, max_cost=1.00, max_requests=100)

Why both? Dollar caps catch expensive calls. Request caps catch cheap-but-infinite loops. You need both.

4. Configure Automatic Model Downgrade

When budget runs low, your agent should degrade gracefully — not crash. Auto-downgrade switches to a cheaper model when you’ve used 70-80% of your budget.

safe_client = guard(
    client,
    max_cost=2.00,
    model_downgrade={
        "gpt-4o": "gpt-4o-mini",
        "claude-3.5-sonnet": "claude-3.5-haiku"
    }
)

The tradeoff: Slightly lower quality responses vs. a crashed agent or a blown budget. For 90% of use cases, the cheaper model’s output is good enough for the remaining steps.

5. Implement a Kill Switch

When things go wrong, you need to be able to stop everything. Not "wait for the next billing cycle." Not "redeploy with new config." A kill switch that halts all agent API calls immediately.

safe_client = guard(client, max_cost=5.00)

# In your ops dashboard or alerting system:
if emergency_detected:
    safe_client.kill()  # All subsequent calls raise BudgetExceeded

Test your kill switch. Seriously. Run a test where you trigger it mid-workflow and verify the agent stops cleanly. A kill switch that doesn’t work is worse than no kill switch — it gives false confidence.

6. Enforce Least-Privilege Tool Access

Your agent has access to tools: database queries, email sending, file operations, API calls. It should only be able to use the tools it actually needs, and nothing more.

from tokenfence import Policy

policy = Policy()
policy.allow("database:read:*")      # Can read any table
policy.deny("database:write:users")  # Cannot modify users table
policy.deny("database:drop:*")       # Cannot drop anything
policy.require_approval("email:send:external")  # Human approval for external emails

# At runtime:
result = policy.check("database:drop:production")
# result.decision == Decision.DENY

The Meta lesson: In March 2026, an AI agent at Meta triggered a SEV1 incident by executing database operations it should never have had access to. Prompts are not permissions. Runtime policy enforcement is.

7. Set Up Per-User or Per-Tenant Budgets

If your agent serves multiple users, each user needs their own budget. One user’s heavy usage shouldn’t drain the budget for everyone else.

def handle_user_request(user_id: str, prompt: str):
    user_budget = get_user_budget(user_id)  # e.g., $0.50/request for free tier
    safe_client = guard(client, max_cost=user_budget)
    
    response = safe_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    
    log_user_cost(user_id, safe_client.total_cost)

Tier your budgets. Free users get $0.10/request. Pro users get $1.00. Enterprise gets $5.00. This isn’t just cost control — it’s your business model.

8. Log Every API Call With Cost Data

You can’t control what you can’t measure. Every LLM API call should log: model used, input/output tokens, cost, latency, and which workflow triggered it.

import logging

safe_client = guard(client, max_cost=2.00)
response = safe_client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

logging.info(f"LLM call: model={response.model}, "
             f"tokens={response.usage.total_tokens}, "
             f"cost=${safe_client.total_cost:.4f}, "
             f"workflow={workflow_id}")

Feed this into your observability stack. Datadog, Grafana, or even a simple CSV. The data tells you which workflows are expensive, which users cost the most, and where to optimize.

9. Test With Production-Realistic Inputs

Your agent costs $0.02 per call in dev because your test inputs are short and simple. Production inputs will be longer, more complex, and more varied. Test with real data.

The process:

  1. Collect 100 representative production inputs (or as close as you can get)
  2. Run your agent on all 100
  3. Calculate: P50, P75, P95, P99 costs per request
  4. Set your budget at 2x P95
  5. Set your kill switch at 3x P99

If you skip this step, your first day in production will be the test — with real money.

10. Handle Budget Exceeded Gracefully

When an agent hits its budget, what happens? If the answer is "it crashes with an unhandled exception," you have a problem.

from tokenfence import guard, BudgetExceeded

safe_client = guard(client, max_cost=1.00)

try:
    response = safe_client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
except BudgetExceeded as e:
    # Return a graceful fallback
    return {
        "status": "budget_exceeded",
        "message": "This request exceeded the cost limit. "
                   "Try a simpler query or contact support.",
        "cost_so_far": e.total_cost
    }

Users should never see a 500 error because your agent ran out of budget. Design the failure mode.

11. Separate Dev/Staging/Production Budgets

Your dev environment should have tight budgets ($0.10/request) so developers feel the cost pressure early. Staging should mirror production budgets. Production budgets should be set based on real data (see #9).

import os

BUDGET_MAP = {
    "development": 0.10,
    "staging": 1.00,
    "production": 2.00,
}

env = os.getenv("APP_ENV", "development")
safe_client = guard(client, max_cost=BUDGET_MAP[env])

Why dev budgets matter: If developers never see cost pressure, they’ll build agents that make 50 API calls when 5 would suffice. Tight dev budgets force efficient agent design.

12. Implement Rate Limiting (Separate From Cost Limits)

Cost limits and rate limits solve different problems. Cost limits prevent expensive requests. Rate limits prevent high-frequency requests — even if each one is cheap.

You need both:

  • Cost limit: "This workflow can’t spend more than $2.00"
  • Rate limit: "This user can’t make more than 10 requests per minute"

Rate limiting prevents abuse. Cost limiting prevents accidents. Together, they cover the full risk surface.

13. Set Up Alerts Before You Need Them

Don’t wait for a cost spike to set up monitoring. Before go-live, configure alerts for:

  • Single request over $X: "A single API call just cost $3.50" — investigate immediately
  • Hourly spend over $Y: "We’ve spent $50 in the last hour" — possible runaway agent
  • Daily spend over $Z: "Daily spend is 3x yesterday" — traffic spike or regression
  • Error rate over N%: "40% of requests are hitting budget limits" — budgets too tight or agent is misbehaving

Alert fatigue is real. Start with 3-4 high-signal alerts. Add more as you learn your normal patterns.

14. Document Your Cost Model

Write down: what each agent workflow costs on average, what the P95 cost is, what the budget is set to, and who owns the budget. Put it in your runbook.

WorkflowAvg CostP95 CostBudget SetOwner
Customer support$0.12$0.45$1.00Support team
Code review$0.80$2.10$5.00Platform team
Content generation$0.25$0.90$2.00Marketing team
Data analysis$1.50$4.20$10.00Data team

Review this quarterly. Model prices change, agent behaviors evolve, and usage patterns shift. Your cost model should be a living document.

15. Run a Pre-Production Cost Drill

Before go-live, simulate a worst case:

  1. Runaway agent: Set up an agent that loops forever. Verify the kill switch stops it within your SLA.
  2. Budget spike: Send 100 concurrent requests. Verify per-request budgets hold under load.
  3. Model unavailability: Mock a model being down. Verify auto-downgrade kicks in correctly.
  4. Policy violation: Attempt a tool call that should be denied. Verify the policy engine blocks it.

If any of these fail in a drill, they’ll fail in production. Better to find out now.

The Quick Reference Card

#ItemPriorityImplementation
1Per-request dollar cap🔴 Criticalguard(client, max_cost=X)
2Per-workflow budget🔴 Criticalguard(client, max_cost=X, max_requests=N)
3Request count limits🔴 Criticalmax_requests=N
4Auto model downgrade🟡 Highmodel_downgrade={...}
5Kill switch🔴 Criticalsafe_client.kill()
6Least-privilege tools🟡 HighPolicy.allow/deny/require_approval
7Per-user budgets🟡 HighBudget by user tier
8Cost logging🔴 CriticalLog model, tokens, cost per call
9Production-realistic testing🟡 High100 sample inputs, calculate P95
10Graceful budget exceeded🟡 HighCatch BudgetExceeded, return fallback
11Env-specific budgets🟠 MediumDev=tight, Staging=mirror, Prod=data-based
12Rate limiting🟠 MediumSeparate from cost limits
13Cost alerts🟡 HighPer-request, hourly, daily thresholds
14Cost model documentation🟠 MediumRunbook with avg/P95/budget/owner
15Pre-production cost drill🟠 MediumSimulate runaway, spike, downgrade, policy

How Many Can You Check Off Today?

Most teams shipping AI agents to production have 2-3 of these 15 items covered. The teams that avoid cost incidents have 10+.

The good news: items 1-5 take about 15 minutes total with TokenFence. That’s the critical tier — the items that prevent the worst outcomes.

# Get started in 30 seconds
pip install tokenfence
# or
npm install tokenfence

Start with the 🔴 Critical items. Add the 🟡 High items before your first week in production. Add the 🟠 Medium items as you scale.

Your agents are only as safe as the guardrails you put around them. This checklist is your guardrail.

TokenFence is open source (MIT). Community edition is free with zero limits. Pro adds dashboard, alerts, and budget pooling. tokenfence.dev/pricing

Ready to protect your AI budget?

Two lines of code. Per-workflow budgets. Automatic model downgrade. Hard kill switch.