AI Agent Cost Control Checklist: 15 Things to Ship Before Your Agents Go Live
The Pre-Production Checklist Nobody Gives You
You’ve built your AI agent. It works in dev. The demo impressed stakeholders. Now someone says: "Ship it."
Here’s what happens next at most companies: the agent goes live, costs spike 20-50x within the first week, and someone scrambles to add guardrails after the damage is done.
This checklist exists so that doesn’t happen to you. Fifteen items, ordered by priority. Each one is a specific, actionable thing your team can implement before (or immediately after) going live. No theory — just the checklist.
The Checklist
1. Set a Per-Request Dollar Cap
Every single API call to an LLM should have a maximum cost. Not a monthly budget. Not an org-wide limit. A per-request cap that prevents any single call from costing more than you expect.
from tokenfence import guard
# Every request capped at $0.50
safe_client = guard(client, max_cost=0.50)
response = safe_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
Why it matters: A single GPT-4o call with a full 128K context window costs ~$5. An agent that makes 10 such calls in a loop costs $50 in seconds. Per-request caps are your first line of defense.
2. Set a Per-Workflow Budget
Agents don’t make one API call — they make dozens. A research agent might call GPT-4o 15-30 times to complete a single task. Your budget needs to cover the entire workflow, not just individual calls.
# Entire workflow capped at $2.00
safe_client = guard(client, max_cost=2.00, max_requests=50)
# Agent can make up to 50 calls totaling $2.00 max
for step in agent_workflow:
response = safe_client.chat.completions.create(
model="gpt-4o",
messages=step.messages
)
The rule: Run your agent on 100 sample inputs. Calculate P95 cost. Set your workflow budget at 2x P95. This gives headroom for edge cases without allowing runaways.
3. Add Request Count Limits
Even cheap models add up. A gpt-4o-mini call costs ~$0.001, but an agent in a loop making 10,000 calls costs $10. Request limits catch infinite loops that dollar caps might miss (because each individual call is cheap).
# Budget + request limit = double protection
safe_client = guard(client, max_cost=1.00, max_requests=100)
Why both? Dollar caps catch expensive calls. Request caps catch cheap-but-infinite loops. You need both.
4. Configure Automatic Model Downgrade
When budget runs low, your agent should degrade gracefully — not crash. Auto-downgrade switches to a cheaper model when you’ve used 70-80% of your budget.
safe_client = guard(
client,
max_cost=2.00,
model_downgrade={
"gpt-4o": "gpt-4o-mini",
"claude-3.5-sonnet": "claude-3.5-haiku"
}
)
The tradeoff: Slightly lower quality responses vs. a crashed agent or a blown budget. For 90% of use cases, the cheaper model’s output is good enough for the remaining steps.
5. Implement a Kill Switch
When things go wrong, you need to be able to stop everything. Not "wait for the next billing cycle." Not "redeploy with new config." A kill switch that halts all agent API calls immediately.
safe_client = guard(client, max_cost=5.00)
# In your ops dashboard or alerting system:
if emergency_detected:
safe_client.kill() # All subsequent calls raise BudgetExceeded
Test your kill switch. Seriously. Run a test where you trigger it mid-workflow and verify the agent stops cleanly. A kill switch that doesn’t work is worse than no kill switch — it gives false confidence.
6. Enforce Least-Privilege Tool Access
Your agent has access to tools: database queries, email sending, file operations, API calls. It should only be able to use the tools it actually needs, and nothing more.
from tokenfence import Policy
policy = Policy()
policy.allow("database:read:*") # Can read any table
policy.deny("database:write:users") # Cannot modify users table
policy.deny("database:drop:*") # Cannot drop anything
policy.require_approval("email:send:external") # Human approval for external emails
# At runtime:
result = policy.check("database:drop:production")
# result.decision == Decision.DENY
The Meta lesson: In March 2026, an AI agent at Meta triggered a SEV1 incident by executing database operations it should never have had access to. Prompts are not permissions. Runtime policy enforcement is.
7. Set Up Per-User or Per-Tenant Budgets
If your agent serves multiple users, each user needs their own budget. One user’s heavy usage shouldn’t drain the budget for everyone else.
def handle_user_request(user_id: str, prompt: str):
user_budget = get_user_budget(user_id) # e.g., $0.50/request for free tier
safe_client = guard(client, max_cost=user_budget)
response = safe_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
log_user_cost(user_id, safe_client.total_cost)
Tier your budgets. Free users get $0.10/request. Pro users get $1.00. Enterprise gets $5.00. This isn’t just cost control — it’s your business model.
8. Log Every API Call With Cost Data
You can’t control what you can’t measure. Every LLM API call should log: model used, input/output tokens, cost, latency, and which workflow triggered it.
import logging
safe_client = guard(client, max_cost=2.00)
response = safe_client.chat.completions.create(
model="gpt-4o",
messages=messages
)
logging.info(f"LLM call: model={response.model}, "
f"tokens={response.usage.total_tokens}, "
f"cost=${safe_client.total_cost:.4f}, "
f"workflow={workflow_id}")
Feed this into your observability stack. Datadog, Grafana, or even a simple CSV. The data tells you which workflows are expensive, which users cost the most, and where to optimize.
9. Test With Production-Realistic Inputs
Your agent costs $0.02 per call in dev because your test inputs are short and simple. Production inputs will be longer, more complex, and more varied. Test with real data.
The process:
- Collect 100 representative production inputs (or as close as you can get)
- Run your agent on all 100
- Calculate: P50, P75, P95, P99 costs per request
- Set your budget at 2x P95
- Set your kill switch at 3x P99
If you skip this step, your first day in production will be the test — with real money.
10. Handle Budget Exceeded Gracefully
When an agent hits its budget, what happens? If the answer is "it crashes with an unhandled exception," you have a problem.
from tokenfence import guard, BudgetExceeded
safe_client = guard(client, max_cost=1.00)
try:
response = safe_client.chat.completions.create(
model="gpt-4o",
messages=messages
)
except BudgetExceeded as e:
# Return a graceful fallback
return {
"status": "budget_exceeded",
"message": "This request exceeded the cost limit. "
"Try a simpler query or contact support.",
"cost_so_far": e.total_cost
}
Users should never see a 500 error because your agent ran out of budget. Design the failure mode.
11. Separate Dev/Staging/Production Budgets
Your dev environment should have tight budgets ($0.10/request) so developers feel the cost pressure early. Staging should mirror production budgets. Production budgets should be set based on real data (see #9).
import os
BUDGET_MAP = {
"development": 0.10,
"staging": 1.00,
"production": 2.00,
}
env = os.getenv("APP_ENV", "development")
safe_client = guard(client, max_cost=BUDGET_MAP[env])
Why dev budgets matter: If developers never see cost pressure, they’ll build agents that make 50 API calls when 5 would suffice. Tight dev budgets force efficient agent design.
12. Implement Rate Limiting (Separate From Cost Limits)
Cost limits and rate limits solve different problems. Cost limits prevent expensive requests. Rate limits prevent high-frequency requests — even if each one is cheap.
You need both:
- Cost limit: "This workflow can’t spend more than $2.00"
- Rate limit: "This user can’t make more than 10 requests per minute"
Rate limiting prevents abuse. Cost limiting prevents accidents. Together, they cover the full risk surface.
13. Set Up Alerts Before You Need Them
Don’t wait for a cost spike to set up monitoring. Before go-live, configure alerts for:
- Single request over $X: "A single API call just cost $3.50" — investigate immediately
- Hourly spend over $Y: "We’ve spent $50 in the last hour" — possible runaway agent
- Daily spend over $Z: "Daily spend is 3x yesterday" — traffic spike or regression
- Error rate over N%: "40% of requests are hitting budget limits" — budgets too tight or agent is misbehaving
Alert fatigue is real. Start with 3-4 high-signal alerts. Add more as you learn your normal patterns.
14. Document Your Cost Model
Write down: what each agent workflow costs on average, what the P95 cost is, what the budget is set to, and who owns the budget. Put it in your runbook.
| Workflow | Avg Cost | P95 Cost | Budget Set | Owner |
|---|---|---|---|---|
| Customer support | $0.12 | $0.45 | $1.00 | Support team |
| Code review | $0.80 | $2.10 | $5.00 | Platform team |
| Content generation | $0.25 | $0.90 | $2.00 | Marketing team |
| Data analysis | $1.50 | $4.20 | $10.00 | Data team |
Review this quarterly. Model prices change, agent behaviors evolve, and usage patterns shift. Your cost model should be a living document.
15. Run a Pre-Production Cost Drill
Before go-live, simulate a worst case:
- Runaway agent: Set up an agent that loops forever. Verify the kill switch stops it within your SLA.
- Budget spike: Send 100 concurrent requests. Verify per-request budgets hold under load.
- Model unavailability: Mock a model being down. Verify auto-downgrade kicks in correctly.
- Policy violation: Attempt a tool call that should be denied. Verify the policy engine blocks it.
If any of these fail in a drill, they’ll fail in production. Better to find out now.
The Quick Reference Card
| # | Item | Priority | Implementation |
|---|---|---|---|
| 1 | Per-request dollar cap | 🔴 Critical | guard(client, max_cost=X) |
| 2 | Per-workflow budget | 🔴 Critical | guard(client, max_cost=X, max_requests=N) |
| 3 | Request count limits | 🔴 Critical | max_requests=N |
| 4 | Auto model downgrade | 🟡 High | model_downgrade={...} |
| 5 | Kill switch | 🔴 Critical | safe_client.kill() |
| 6 | Least-privilege tools | 🟡 High | Policy.allow/deny/require_approval |
| 7 | Per-user budgets | 🟡 High | Budget by user tier |
| 8 | Cost logging | 🔴 Critical | Log model, tokens, cost per call |
| 9 | Production-realistic testing | 🟡 High | 100 sample inputs, calculate P95 |
| 10 | Graceful budget exceeded | 🟡 High | Catch BudgetExceeded, return fallback |
| 11 | Env-specific budgets | 🟠 Medium | Dev=tight, Staging=mirror, Prod=data-based |
| 12 | Rate limiting | 🟠 Medium | Separate from cost limits |
| 13 | Cost alerts | 🟡 High | Per-request, hourly, daily thresholds |
| 14 | Cost model documentation | 🟠 Medium | Runbook with avg/P95/budget/owner |
| 15 | Pre-production cost drill | 🟠 Medium | Simulate runaway, spike, downgrade, policy |
How Many Can You Check Off Today?
Most teams shipping AI agents to production have 2-3 of these 15 items covered. The teams that avoid cost incidents have 10+.
The good news: items 1-5 take about 15 minutes total with TokenFence. That’s the critical tier — the items that prevent the worst outcomes.
# Get started in 30 seconds
pip install tokenfence
# or
npm install tokenfence
Start with the 🔴 Critical items. Add the 🟡 High items before your first week in production. Add the 🟠 Medium items as you scale.
Your agents are only as safe as the guardrails you put around them. This checklist is your guardrail.
TokenFence is open source (MIT). Community edition is free with zero limits. Pro adds dashboard, alerts, and budget pooling. tokenfence.dev/pricing
Ready to protect your AI budget?
Two lines of code. Per-workflow budgets. Automatic model downgrade. Hard kill switch.