Insights/2026-01-08·4 min read

Escape the API Mines: How Far Can One RTX 4090 Actually Go?

I bought a $1,800 GPU to stop paying OpenAI. Then I learned that VRAM doesn't care about my feelings. Here is the brutal math on running Llama-3 70B on a single card.

local-llmhardwarertx-4090llama-3vram-optimization

I’m looking at my OpenAI usage dashboard thinking:

“I’m burning $200/month renting intelligence. If I buy a GPU, it pays for itself in 9 months. Free tokens forever.”

So I did what any rational engineer with poor impulse control does: I bought an RTX 4090 (24GB).

My plan was simple:

  1. Install ollama or ExLlamaV2.
  2. Download Llama-3 70B.
  3. Fire OpenAI.

Then I hit run, and my computer froze for 45 seconds before spitting out one token per second.

I didn't escape the API mines. I just bought a very expensive space heater.

Here is the technical reality check nobody gives you before you swipe your card.

# 1. The Misconception: "24GB is Huge"

In gaming, 24GB VRAM is god-tier. In LLM land, 24GB is a studio apartment.

Most people assume:

“70B is just a number. Compression (Quantization) is magic. It’ll fit.”

Nope.

Math doesn't care about your optimism.

Llama-3 70B has 70 billion parameters. At FP16 (standard precision), that’s 70B * 2 bytes = 140GB.

Your 4090 has 24GB.

You are trying to park a Boeing 747 in a residential garage.

# 2. The "Quantization" Gamble

"But what about 4-bit quantization?" you ask.

Let's look at the actual GGUF sizes for a 70B model:

  • Q8_0 (8-bit): ~75 GB (Need 4x 3090s)
  • Q4_K_M (4-bit): ~42 GB (Need 2x 3090s/4090s)
  • Q2_K (2-bit): ~26 GB (Still doesn't fit on one card)

Even if you crush the model down to 4-bit (which is the industry standard for "usable intelligence"), you need 42GB of VRAM.

With a single 4090, you are short by 18GB.

⚠️ Warning: Measure before you buy

I realized this after the card arrived. Don't be like me. Check the VRAM Calculator first. The difference between Q4 and Q2 is massive.

# 3. The "Offloading" Lie

The internet will tell you:

"Just offload the rest to your System RAM! It’s fine!"

It is not fine.

When you split a model between GPU (VRAM) and CPU (DDR5 RAM), you are bottlenecked by the PCIe bus transfer speeds.

The Speed Penalty:

  • Full GPU offload: ~40-60 tokens/sec (Instant coding assistance)
  • Mixed CPU/GPU: ~2-4 tokens/sec (Painfully slow reading speed)

If you are building a RAG app or an Agent loop, 3 tokens/second is useless. You will wait 5 minutes for a code refactor that GPT-4o does in 10 seconds.

# 4. So... What Can One 4090 Actually Run?

If you stick to a single card, you have to choose: High IQ (Slow) or Medium IQ (Fast).

# The Sweet Spot: 30B - 35B Models

This is where the 4090 actually shines.

  • Yi-34B (Q4): ~20GB. Fits entirely in VRAM.
  • Speed: 50+ tokens/sec.
  • Quality: Better than GPT-3.5, slightly below GPT-4.

# The "Mixture of Experts" (Mixtral 8x7B)

  • Mixtral 8x7B (Q4): ~26GB.
  • Hack: With a high context window, this overflows. But with Q3_K_M (~20GB), it fits perfectly.
  • Result: This is currently the best coding assistant you can run on a single card.

# The "Lobotomy" Option (70B at IQ2_XXS)

You can run Llama-3 70B on one card if you use IQ2_XXS quantization (approx 2.0 bits per weight).

  • Size: ~22GB.
  • Result: It runs fast, but it's brain-damaged. It forgets instructions, hallucinates libraries, and fails logic tests that the 8B model passes.

Don't run a lobotomized 70B just to say you're running 70B.

# 5. The Hidden Money Pits (Hardware Edition)

API costs are visible. Hardware costs are invisible until you check the meter.

  1. Electricity: My 4090 rig pulls ~500W under load. If I run it 24/7 as a server, that's $54/month in electricity alone.
  2. The "Second Card" Trap: Once you realize 24GB isn't enough, you'll want a second card. But 4090s are huge. You'll need a new motherboard, a massive case, and a 1600W PSU. Suddenly your "$1,800 project" is a "$4,000 workstation."

# Conclusion: My Survival Strategy

I didn't sell the card. But I stopped trying to force Llama-3 70B into it.

My Daily Driver Stack:

  1. Coding: DeepSeek-Coder-33B (Q4). Fits perfectly. Fast completion.
  2. General Chat: Llama-3 8B (FP16). Lightning fast (100+ t/s).
  3. Complex Logic: API (Claude 3.5 Sonnet).

I use the GPU for the 90% of "dumb tasks" (autocomplete, simple refactors, summarization) and pay the API for the 10% of "genius tasks."

That cut my API bill from $200 -> $20.

If you are browsing eBay for used 3090s right now, stop. Do the math first. Check if the specific model + quantization + context window you want actually fits in the VRAM you are buying.

# 👉 Can I Run It? Check VRAM Calculator