# TL;DR

AI agents can burn 10-100x more tokens than simple chat completions
Three cost killers: tool loops, context accumulation, retry storms
Fix #1: Set hard limits on iterations (max 5-10 tool calls per task)
Fix #2: Compress conversation history, don't just append
Fix #3: Use cheaper models for tool selection, expensive models for final output
Monitor tokens per task, not just requests per day

# Who This Is For

Developers building agentic AI applications (AutoGPT-style, LangChain agents, custom tool-use workflows). You've noticed your API bill spike unexpectedly and want to understand why.

# Assumptions & Inputs

Agent framework: LangChain, AutoGen, CrewAI, or custom
Tools: web search, code execution, database queries, file operations
Model: GPT-4o or Claude 3.5 Sonnet
Use case: research, coding, data analysis, or multi-step tasks

# The $50 Wake-Up Call

I built a "research agent" that could browse the web, take notes, and synthesize reports. It worked beautifully in demos.

Then I let it research a moderately complex topic overnight.

Morning surprise: $47.23 in API charges. For one research task.

The agent had:

Made 127 tool calls
Accumulated 890K tokens of context
Retry-looped on 3 failed web requests
Generated a 4-page report I could have written in 30 minutes

This is the dark side of agentic AI that framework tutorials don't emphasize: agents are token incinerators by design.

# Why Agents Cost So Much

# Problem #1: The Tool Loop Trap

Every time an agent uses a tool, it's a full LLM call:

Send current context + tool definitions
LLM decides which tool to call
Execute tool, get result
Send context + tool result back to LLM
Repeat until task complete

A "simple" 5-tool task might look like this:

Step	Input Tokens	Output Tokens
Initial prompt	2,000	150
Tool call #1	2,500	200
Tool call #2	4,000	180
Tool call #3	6,500	250
Tool call #4	9,000	300
Tool call #5	12,000	400
Final response	14,000	1,500
Total	50,000	2,980

At GPT-4o prices ($2.50/$10.00 per 1M), that's $0.155 per task—and this is a simple example.

# Problem #2: Context Accumulation

Most agent frameworks append every observation to the conversation:

System: You are a research agent...
User: Research topic X
Assistant: I'll search for information...
Tool: [search result - 2,000 tokens]
Assistant: Found some info, let me dig deeper...
Tool: [another search - 3,000 tokens]
Assistant: Now let me check this source...
Tool: [web scrape - 5,000 tokens]
...and so on

By step 10, you're sending 50K+ tokens of context with every LLM call. The context grows linearly, but your costs grow quadratically (each step is more expensive than the last).

# Problem #3: Retry Storms

When tools fail (API timeout, rate limit, parsing error), naive agents retry with the full context:

Attempt 1: 20K tokens → timeout
Attempt 2: 20K tokens → timeout
Attempt 3: 20K tokens → success

Three attempts = 3x the cost, with no new information gained.

# The Fixes

# Fix #1: Hard Iteration Limits

Set a maximum number of tool calls per task. Period.

MAX_ITERATIONS = 8

for i in range(MAX_ITERATIONS):
    response = agent.step()
    if response.is_final:
        break
else:
    # Force conclusion after max iterations
    response = agent.force_conclude()

In my experience, if an agent can't solve a task in 8-10 tool calls, it's probably stuck in a loop or the task needs to be broken down.

# Fix #2: Context Compression

Don't append raw tool outputs. Summarize them.

Before: Append 5,000-token web page to context

After: Extract relevant 200-token summary, discard the rest

def compress_tool_result(result, max_tokens=500):
    if len(result) > max_tokens:
        # Use a cheap model to summarize
        summary = summarize_with_mini(result, max_tokens)
        return summary
    return result

This alone can reduce context growth by 80-90%.

# Fix #3: Model Routing

Use cheap models for tool selection, expensive models for final output.

Task	Model	Why
Tool selection	GPT-4o-mini	Just picking from a list
Parameter extraction	GPT-4o-mini	Structured output
Web content parsing	GPT-4o-mini	Simple extraction
Final synthesis	GPT-4o / Claude	Needs quality

This can cut per-task costs by 60-70%.

# Fix #4: Batch Tool Calls

If your agent needs to search 5 things, do it in one tool call:

Expensive pattern:

→ Search "topic A" → result
→ Search "topic B" → result
→ Search "topic C" → result
(3 LLM round-trips)

Cheaper pattern:

→ Search ["topic A", "topic B", "topic C"] → [results]
(1 LLM round-trip)

Many APIs support batch operations. Use them.

# Fix #5: Caching Tool Results

If multiple agents or runs might need the same information, cache it:

@cache(ttl=3600)
def web_search(query):
    return search_api.search(query)

Web searches, database queries, and file reads are prime candidates.

# Cost-Aware Architecture Patterns

# Pattern 1: Planner-Executor Split

Instead of letting an agent freely explore, use a two-phase approach:

Planner (1 LLM call): Given the task, output a step-by-step plan
Executor (N tool calls): Execute each step with minimal context

This prevents context bloat and gives you predictable costs.

# Pattern 2: Hierarchical Agents

For complex tasks, use a hierarchy:

Manager Agent: Coordinates, uses GPT-4o
Worker Agents: Execute specific subtasks, use GPT-4o-mini

The manager handles high-level reasoning; workers handle grunt work.

# Pattern 3: Human-in-the-Loop Checkpoints

For expensive operations, pause and ask:

Agent: I'm about to perform 15 web searches. Estimated cost: $0.50
       Proceed? [Y/n]

This prevents runaway costs and gives users control.

# Monitoring: What to Track

Don't just track "requests per day." Track:

Tokens per task (not per request)
Tool calls per task
Context size at each step
Retry rate
Cost per successful task completion

Set alerts for anomalies. A sudden spike in tool calls often indicates a loop.

# Real-World Cost Comparison

Same research task, three approaches:

Approach	Tool Calls	Total Tokens	Cost
Naive agent	45	380K	$1.52
With iteration limit (10)	10	85K	$0.34
Optimized (all fixes)	8	28K	$0.11

14x cost reduction from the naive approach—and the optimized version often produces better results because it's forced to be focused.

# Framework-Specific Tips

# LangChain

Use max_iterations parameter
Implement custom OutputParser to compress results
Consider RunnableWithMessageHistory with summarization

# AutoGen

Set max_consecutive_auto_reply
Use GroupChat with speaker selection for routing
Implement cost tracking via get_total_usage()

# CrewAI

Set max_iter on agents
Use memory features wisely (can accumulate tokens)
Route via manager_llm vs agent llm

# Calculate Your Agent Costs

Before building, estimate. After building, measure.

Estimate your agent costs

Model your per-task token usage and see what different optimization strategies would save.

Open Calculator

# Conclusion

AI agents are powerful, but they're also expensive by nature. Every tool call is a full LLM invocation with growing context.

The fixes aren't complicated:

Limit iterations (hard cap at 8-10)
Compress context (summarize, don't append)
Route by task (cheap models for simple decisions)
Batch operations (fewer round-trips)
Monitor tokens per task (not just API calls)

For more on optimizing LLM costs, see Batch vs Live API and Prompt Caching guide.

AI Agent Costs: Why Your Agent Burned $50 in 10 Minutes