I’m looking at my OpenAI usage dashboard thinking:
“I’m burning $200/month renting intelligence. If I buy a GPU, it pays for itself in 9 months. Free tokens forever.”
So I did what any rational engineer with poor impulse control does: I bought an RTX 4090 (24GB).
My plan was simple:
- Install
ollamaorExLlamaV2. - Download Llama-3 70B.
- Fire OpenAI.
Then I hit run, and my computer froze for 45 seconds before spitting out one token per second.
I didn't escape the API mines. I just bought a very expensive space heater.
Here is the technical reality check nobody gives you before you swipe your card.
# 1. The Misconception: "24GB is Huge"
In gaming, 24GB VRAM is god-tier. In LLM land, 24GB is a studio apartment.
Most people assume:
“70B is just a number. Compression (Quantization) is magic. It’ll fit.”
Nope.
Math doesn't care about your optimism.
Llama-3 70B has 70 billion parameters.
At FP16 (standard precision), that’s 70B * 2 bytes = 140GB.
Your 4090 has 24GB.
You are trying to park a Boeing 747 in a residential garage.
# 2. The "Quantization" Gamble
"But what about 4-bit quantization?" you ask.
Let's look at the actual GGUF sizes for a 70B model:
- Q8_0 (8-bit): ~75 GB (Need 4x 3090s)
- Q4_K_M (4-bit): ~42 GB (Need 2x 3090s/4090s)
- Q2_K (2-bit): ~26 GB (Still doesn't fit on one card)
Even if you crush the model down to 4-bit (which is the industry standard for "usable intelligence"), you need 42GB of VRAM.
With a single 4090, you are short by 18GB.
⚠️ Warning: Measure before you buy
I realized this after the card arrived. Don't be like me. Check the VRAM Calculator first. The difference between Q4 and Q2 is massive.
# 3. The "Offloading" Lie
The internet will tell you:
"Just offload the rest to your System RAM! It’s fine!"
It is not fine.
When you split a model between GPU (VRAM) and CPU (DDR5 RAM), you are bottlenecked by the PCIe bus transfer speeds.
The Speed Penalty:
- Full GPU offload: ~40-60 tokens/sec (Instant coding assistance)
- Mixed CPU/GPU: ~2-4 tokens/sec (Painfully slow reading speed)
If you are building a RAG app or an Agent loop, 3 tokens/second is useless. You will wait 5 minutes for a code refactor that GPT-4o does in 10 seconds.
# 4. So... What Can One 4090 Actually Run?
If you stick to a single card, you have to choose: High IQ (Slow) or Medium IQ (Fast).
# The Sweet Spot: 30B - 35B Models
This is where the 4090 actually shines.
- Yi-34B (Q4): ~20GB. Fits entirely in VRAM.
- Speed: 50+ tokens/sec.
- Quality: Better than GPT-3.5, slightly below GPT-4.
# The "Mixture of Experts" (Mixtral 8x7B)
- Mixtral 8x7B (Q4): ~26GB.
- Hack: With a high context window, this overflows. But with
Q3_K_M(~20GB), it fits perfectly. - Result: This is currently the best coding assistant you can run on a single card.
# The "Lobotomy" Option (70B at IQ2_XXS)
You can run Llama-3 70B on one card if you use IQ2_XXS quantization (approx 2.0 bits per weight).
- Size: ~22GB.
- Result: It runs fast, but it's brain-damaged. It forgets instructions, hallucinates libraries, and fails logic tests that the 8B model passes.
Don't run a lobotomized 70B just to say you're running 70B.
# 5. The Hidden Money Pits (Hardware Edition)
API costs are visible. Hardware costs are invisible until you check the meter.
- Electricity: My 4090 rig pulls ~500W under load. If I run it 24/7 as a server, that's $54/month in electricity alone.
- The "Second Card" Trap: Once you realize 24GB isn't enough, you'll want a second card. But 4090s are huge. You'll need a new motherboard, a massive case, and a 1600W PSU. Suddenly your "$1,800 project" is a "$4,000 workstation."
# Conclusion: My Survival Strategy
I didn't sell the card. But I stopped trying to force Llama-3 70B into it.
My Daily Driver Stack:
- Coding: DeepSeek-Coder-33B (Q4). Fits perfectly. Fast completion.
- General Chat: Llama-3 8B (FP16). Lightning fast (100+ t/s).
- Complex Logic: API (Claude 3.5 Sonnet).
I use the GPU for the 90% of "dumb tasks" (autocomplete, simple refactors, summarization) and pay the API for the 10% of "genius tasks."
That cut my API bill from $200 -> $20.
If you are browsing eBay for used 3090s right now, stop. Do the math first. Check if the specific model + quantization + context window you want actually fits in the VRAM you are buying.