# TL;DR

  • AI agents can burn 10-100x more tokens than simple chat completions
  • Three cost killers: tool loops, context accumulation, retry storms
  • Fix #1: Set hard limits on iterations (max 5-10 tool calls per task)
  • Fix #2: Compress conversation history, don't just append
  • Fix #3: Use cheaper models for tool selection, expensive models for final output
  • Monitor tokens per task, not just requests per day

# Who This Is For

Developers building agentic AI applications (AutoGPT-style, LangChain agents, custom tool-use workflows). You've noticed your API bill spike unexpectedly and want to understand why.

# Assumptions & Inputs

  • Agent framework: LangChain, AutoGen, CrewAI, or custom
  • Tools: web search, code execution, database queries, file operations
  • Model: GPT-4o or Claude 3.5 Sonnet
  • Use case: research, coding, data analysis, or multi-step tasks

# The $50 Wake-Up Call

I built a "research agent" that could browse the web, take notes, and synthesize reports. It worked beautifully in demos.

Then I let it research a moderately complex topic overnight.

Morning surprise: $47.23 in API charges. For one research task.

The agent had:

  • Made 127 tool calls
  • Accumulated 890K tokens of context
  • Retry-looped on 3 failed web requests
  • Generated a 4-page report I could have written in 30 minutes

This is the dark side of agentic AI that framework tutorials don't emphasize: agents are token incinerators by design.


# Why Agents Cost So Much

# Problem #1: The Tool Loop Trap

Every time an agent uses a tool, it's a full LLM call:

  1. Send current context + tool definitions
  2. LLM decides which tool to call
  3. Execute tool, get result
  4. Send context + tool result back to LLM
  5. Repeat until task complete

A "simple" 5-tool task might look like this:

StepInput TokensOutput Tokens
Initial prompt2,000150
Tool call #12,500200
Tool call #24,000180
Tool call #36,500250
Tool call #49,000300
Tool call #512,000400
Final response14,0001,500
Total50,0002,980

At GPT-4o prices ($2.50/$10.00 per 1M), that's $0.155 per task—and this is a simple example.

# Problem #2: Context Accumulation

Most agent frameworks append every observation to the conversation:

System: You are a research agent...
User: Research topic X
Assistant: I'll search for information...
Tool: [search result - 2,000 tokens]
Assistant: Found some info, let me dig deeper...
Tool: [another search - 3,000 tokens]
Assistant: Now let me check this source...
Tool: [web scrape - 5,000 tokens]
...and so on

By step 10, you're sending 50K+ tokens of context with every LLM call. The context grows linearly, but your costs grow quadratically (each step is more expensive than the last).

# Problem #3: Retry Storms

When tools fail (API timeout, rate limit, parsing error), naive agents retry with the full context:

Attempt 1: 20K tokens → timeout
Attempt 2: 20K tokens → timeout
Attempt 3: 20K tokens → success

Three attempts = 3x the cost, with no new information gained.


# The Fixes

# Fix #1: Hard Iteration Limits

Set a maximum number of tool calls per task. Period.

MAX_ITERATIONS = 8

for i in range(MAX_ITERATIONS):
    response = agent.step()
    if response.is_final:
        break
else:
    # Force conclusion after max iterations
    response = agent.force_conclude()

In my experience, if an agent can't solve a task in 8-10 tool calls, it's probably stuck in a loop or the task needs to be broken down.

# Fix #2: Context Compression

Don't append raw tool outputs. Summarize them.

Before: Append 5,000-token web page to context

After: Extract relevant 200-token summary, discard the rest

def compress_tool_result(result, max_tokens=500):
    if len(result) > max_tokens:
        # Use a cheap model to summarize
        summary = summarize_with_mini(result, max_tokens)
        return summary
    return result

This alone can reduce context growth by 80-90%.

# Fix #3: Model Routing

Use cheap models for tool selection, expensive models for final output.

TaskModelWhy
Tool selectionGPT-4o-miniJust picking from a list
Parameter extractionGPT-4o-miniStructured output
Web content parsingGPT-4o-miniSimple extraction
Final synthesisGPT-4o / ClaudeNeeds quality

This can cut per-task costs by 60-70%.

# Fix #4: Batch Tool Calls

If your agent needs to search 5 things, do it in one tool call:

Expensive pattern:

→ Search "topic A" → result
→ Search "topic B" → result
→ Search "topic C" → result
(3 LLM round-trips)

Cheaper pattern:

→ Search ["topic A", "topic B", "topic C"] → [results]
(1 LLM round-trip)

Many APIs support batch operations. Use them.

# Fix #5: Caching Tool Results

If multiple agents or runs might need the same information, cache it:

@cache(ttl=3600)
def web_search(query):
    return search_api.search(query)

Web searches, database queries, and file reads are prime candidates.


# Cost-Aware Architecture Patterns

# Pattern 1: Planner-Executor Split

Instead of letting an agent freely explore, use a two-phase approach:

  1. Planner (1 LLM call): Given the task, output a step-by-step plan
  2. Executor (N tool calls): Execute each step with minimal context

This prevents context bloat and gives you predictable costs.

# Pattern 2: Hierarchical Agents

For complex tasks, use a hierarchy:

  • Manager Agent: Coordinates, uses GPT-4o
  • Worker Agents: Execute specific subtasks, use GPT-4o-mini

The manager handles high-level reasoning; workers handle grunt work.

# Pattern 3: Human-in-the-Loop Checkpoints

For expensive operations, pause and ask:

Agent: I'm about to perform 15 web searches. Estimated cost: $0.50
       Proceed? [Y/n]

This prevents runaway costs and gives users control.


# Monitoring: What to Track

Don't just track "requests per day." Track:

  1. Tokens per task (not per request)
  2. Tool calls per task
  3. Context size at each step
  4. Retry rate
  5. Cost per successful task completion

Set alerts for anomalies. A sudden spike in tool calls often indicates a loop.


# Real-World Cost Comparison

Same research task, three approaches:

ApproachTool CallsTotal TokensCost
Naive agent45380K$1.52
With iteration limit (10)1085K$0.34
Optimized (all fixes)828K$0.11

14x cost reduction from the naive approach—and the optimized version often produces better results because it's forced to be focused.


# Framework-Specific Tips

# LangChain

  • Use max_iterations parameter
  • Implement custom OutputParser to compress results
  • Consider RunnableWithMessageHistory with summarization

# AutoGen

  • Set max_consecutive_auto_reply
  • Use GroupChat with speaker selection for routing
  • Implement cost tracking via get_total_usage()

# CrewAI

  • Set max_iter on agents
  • Use memory features wisely (can accumulate tokens)
  • Route via manager_llm vs agent llm

# Calculate Your Agent Costs

Before building, estimate. After building, measure.

Estimate your agent costs
Model your per-task token usage and see what different optimization strategies would save.
Open Calculator

# Conclusion

AI agents are powerful, but they're also expensive by nature. Every tool call is a full LLM invocation with growing context.

The fixes aren't complicated:

  1. Limit iterations (hard cap at 8-10)
  2. Compress context (summarize, don't append)
  3. Route by task (cheap models for simple decisions)
  4. Batch operations (fewer round-trips)
  5. Monitor tokens per task (not just API calls)

For more on optimizing LLM costs, see Batch vs Live API and Prompt Caching guide.

T

TokenBurner Team

AI Infrastructure Engineers

Engineers with hands-on experience building production AI systems. We've debugged runaway agent costs and learned how to build efficient agentic workflows.

Learn more about TokenBurner →