# TL;DR
- AI agents can burn 10-100x more tokens than simple chat completions
- Three cost killers: tool loops, context accumulation, retry storms
- Fix #1: Set hard limits on iterations (max 5-10 tool calls per task)
- Fix #2: Compress conversation history, don't just append
- Fix #3: Use cheaper models for tool selection, expensive models for final output
- Monitor tokens per task, not just requests per day
# Who This Is For
Developers building agentic AI applications (AutoGPT-style, LangChain agents, custom tool-use workflows). You've noticed your API bill spike unexpectedly and want to understand why.
# Assumptions & Inputs
- Agent framework: LangChain, AutoGen, CrewAI, or custom
- Tools: web search, code execution, database queries, file operations
- Model: GPT-4o or Claude 3.5 Sonnet
- Use case: research, coding, data analysis, or multi-step tasks
# The $50 Wake-Up Call
I built a "research agent" that could browse the web, take notes, and synthesize reports. It worked beautifully in demos.
Then I let it research a moderately complex topic overnight.
Morning surprise: $47.23 in API charges. For one research task.
The agent had:
- Made 127 tool calls
- Accumulated 890K tokens of context
- Retry-looped on 3 failed web requests
- Generated a 4-page report I could have written in 30 minutes
This is the dark side of agentic AI that framework tutorials don't emphasize: agents are token incinerators by design.
# Why Agents Cost So Much
# Problem #1: The Tool Loop Trap
Every time an agent uses a tool, it's a full LLM call:
- Send current context + tool definitions
- LLM decides which tool to call
- Execute tool, get result
- Send context + tool result back to LLM
- Repeat until task complete
A "simple" 5-tool task might look like this:
| Step | Input Tokens | Output Tokens |
|---|---|---|
| Initial prompt | 2,000 | 150 |
| Tool call #1 | 2,500 | 200 |
| Tool call #2 | 4,000 | 180 |
| Tool call #3 | 6,500 | 250 |
| Tool call #4 | 9,000 | 300 |
| Tool call #5 | 12,000 | 400 |
| Final response | 14,000 | 1,500 |
| Total | 50,000 | 2,980 |
At GPT-4o prices ($2.50/$10.00 per 1M), that's $0.155 per task—and this is a simple example.
# Problem #2: Context Accumulation
Most agent frameworks append every observation to the conversation:
System: You are a research agent...
User: Research topic X
Assistant: I'll search for information...
Tool: [search result - 2,000 tokens]
Assistant: Found some info, let me dig deeper...
Tool: [another search - 3,000 tokens]
Assistant: Now let me check this source...
Tool: [web scrape - 5,000 tokens]
...and so on
By step 10, you're sending 50K+ tokens of context with every LLM call. The context grows linearly, but your costs grow quadratically (each step is more expensive than the last).
# Problem #3: Retry Storms
When tools fail (API timeout, rate limit, parsing error), naive agents retry with the full context:
Attempt 1: 20K tokens → timeout
Attempt 2: 20K tokens → timeout
Attempt 3: 20K tokens → success
Three attempts = 3x the cost, with no new information gained.
# The Fixes
# Fix #1: Hard Iteration Limits
Set a maximum number of tool calls per task. Period.
MAX_ITERATIONS = 8
for i in range(MAX_ITERATIONS):
response = agent.step()
if response.is_final:
break
else:
# Force conclusion after max iterations
response = agent.force_conclude()
In my experience, if an agent can't solve a task in 8-10 tool calls, it's probably stuck in a loop or the task needs to be broken down.
# Fix #2: Context Compression
Don't append raw tool outputs. Summarize them.
Before: Append 5,000-token web page to context
After: Extract relevant 200-token summary, discard the rest
def compress_tool_result(result, max_tokens=500):
if len(result) > max_tokens:
# Use a cheap model to summarize
summary = summarize_with_mini(result, max_tokens)
return summary
return result
This alone can reduce context growth by 80-90%.
# Fix #3: Model Routing
Use cheap models for tool selection, expensive models for final output.
| Task | Model | Why |
|---|---|---|
| Tool selection | GPT-4o-mini | Just picking from a list |
| Parameter extraction | GPT-4o-mini | Structured output |
| Web content parsing | GPT-4o-mini | Simple extraction |
| Final synthesis | GPT-4o / Claude | Needs quality |
This can cut per-task costs by 60-70%.
# Fix #4: Batch Tool Calls
If your agent needs to search 5 things, do it in one tool call:
Expensive pattern:
→ Search "topic A" → result
→ Search "topic B" → result
→ Search "topic C" → result
(3 LLM round-trips)
Cheaper pattern:
→ Search ["topic A", "topic B", "topic C"] → [results]
(1 LLM round-trip)
Many APIs support batch operations. Use them.
# Fix #5: Caching Tool Results
If multiple agents or runs might need the same information, cache it:
@cache(ttl=3600)
def web_search(query):
return search_api.search(query)
Web searches, database queries, and file reads are prime candidates.
# Cost-Aware Architecture Patterns
# Pattern 1: Planner-Executor Split
Instead of letting an agent freely explore, use a two-phase approach:
- Planner (1 LLM call): Given the task, output a step-by-step plan
- Executor (N tool calls): Execute each step with minimal context
This prevents context bloat and gives you predictable costs.
# Pattern 2: Hierarchical Agents
For complex tasks, use a hierarchy:
- Manager Agent: Coordinates, uses GPT-4o
- Worker Agents: Execute specific subtasks, use GPT-4o-mini
The manager handles high-level reasoning; workers handle grunt work.
# Pattern 3: Human-in-the-Loop Checkpoints
For expensive operations, pause and ask:
Agent: I'm about to perform 15 web searches. Estimated cost: $0.50
Proceed? [Y/n]
This prevents runaway costs and gives users control.
# Monitoring: What to Track
Don't just track "requests per day." Track:
- Tokens per task (not per request)
- Tool calls per task
- Context size at each step
- Retry rate
- Cost per successful task completion
Set alerts for anomalies. A sudden spike in tool calls often indicates a loop.
# Real-World Cost Comparison
Same research task, three approaches:
| Approach | Tool Calls | Total Tokens | Cost |
|---|---|---|---|
| Naive agent | 45 | 380K | $1.52 |
| With iteration limit (10) | 10 | 85K | $0.34 |
| Optimized (all fixes) | 8 | 28K | $0.11 |
14x cost reduction from the naive approach—and the optimized version often produces better results because it's forced to be focused.
# Framework-Specific Tips
# LangChain
- Use
max_iterationsparameter - Implement custom
OutputParserto compress results - Consider
RunnableWithMessageHistorywith summarization
# AutoGen
- Set
max_consecutive_auto_reply - Use
GroupChatwith speaker selection for routing - Implement cost tracking via
get_total_usage()
# CrewAI
- Set
max_iteron agents - Use
memoryfeatures wisely (can accumulate tokens) - Route via
manager_llmvs agentllm
# Calculate Your Agent Costs
Before building, estimate. After building, measure.
# Conclusion
AI agents are powerful, but they're also expensive by nature. Every tool call is a full LLM invocation with growing context.
The fixes aren't complicated:
- Limit iterations (hard cap at 8-10)
- Compress context (summarize, don't append)
- Route by task (cheap models for simple decisions)
- Batch operations (fewer round-trips)
- Monitor tokens per task (not just API calls)
For more on optimizing LLM costs, see Batch vs Live API and Prompt Caching guide.
TokenBurner Team
AI Infrastructure Engineers
Engineers with hands-on experience building production AI systems. We've debugged runaway agent costs and learned how to build efficient agentic workflows.
Learn more about TokenBurner →