InkdownInkdown
Start writing

Yt Trans

15 files·15 subfolders

Shared Workspace

Yt Trans
0 Jobs To 4l Month The Ai Business That Changed My Life

tldr

Shared from "Yt Trans" on Inkdown

════════════════════════════════════════

Optimize Your AI - Quantization Explained

Matt Williams · 12:09 · 2024-12-28


What This Is Actually About

AI models are massive collections of numbers—billions of them—traditionally stored in 32-bit precision requiring enormous RAM. A 7B parameter model needs 28GB just to store weights. Quantization compresses these numbers into lower precision formats (Q2, Q4, Q8) and context quantization compresses conversation history, making multi-billion parameter models runnable on ordinary laptops with only gigabytes of memory.


Key Points

The Memory Math That Blocks Most Users

A 7 billion parameter model stored at standard 32-bit precision requires 4 bytes per parameter: 7B × 4 = 28GB RAM just for storage. This exceeds most consumer GPUs and requires 2,0003,000 hardware to run a single model.

tldr.md
8 Lpa To 55 Lpa In 4 Months No Bs Breakdown Ft Dhairyasheel
tldr.md
Building The Perfect Linux Pc With Linus Torvalds
tldr.md
Claude Mythos Clone Shocks Anthropic And Openai
tldr.md
Don T Let Ai Rob You
tldr.md
English Or Spanish India S Got Latent
tldr.md
From Zero To Senior How I Grew In My Career
tldr.md
How to become 10x smarter
file
How To Learn Ai Engineering In 5 Minutes No Prior Knowledge
tldr.md
Mcp Vs Acp The Two Protocols Every Ai Builder Needs To Know
tldr.md
Optimize Your Ai Quantization Explained
tldr.md
Realistic Advice About Software Dev Right Now
tldr.md
Stop Using Ai For These Things
tldr.md
The Most Talented Man In Ai
tldr.md
What Is The Future Of Coding With Ai
tldr.md
–2,000–
2,000–
How Quantization Actually Works

Quantization reduces numeric precision. Think of it as switching rulers: 32-bit is micrometer-precision, Q8 measures in centimeters, Q4 marks every 5cm, and Q2 is a rough approximation. Each step reduces memory proportionally—Q4 cuts storage by roughly 8× compared to full precision.

K-Quants Use Smart Grouping, Not Uniform Compression

Models tagged with K (Q4_K_M, Q4_K_L) use k-quant, which creates multiple "mail rooms" for different number ranges. Small numbers get precise slots; large numbers get appropriately sized spaces. This adapts to data distribution rather than forcing everything into identical boxes. KS/KM/KL (small/medium/large) indicate detail levels in tracking how values map.

Context Quantization Slashes Conversation Memory

Modern models handle 128K tokens of context—equivalent to remembering entire books. Conversation history consumes significant RAM. Context quantization compresses the KV cache (key-value cache storing prior conversation state). With flash attention enabled (OLLAMA_FLASH_ATTENTION=1) and KV cache set to F16 or Q8 (OLLAMA_KV_CACHE_TYPE=Q8), memory usage drops substantially.

Demo: 10GB+ Memory Savings on a 7B Model

Using Qwen 2.5 7B at 32K context window:

  • Default: peaks at ~40.9GB RAM
  • With flash attention: drops to ~33.7GB (saves ~7GB)
  • With flash attention + Q8 KV cache: drops to ~30.6GB (saves ~10GB total)

Results vary by model—one IBM model performed worse with Q8 KV cache than without it.

Practical Selection Guide

Start with Q4_K_M (Ollama's current default). If generation quality degrades, move up to Q8 or FP16. If quality holds, try Q2 for maximum memory savings. Combine quantized weights with quantized context to run large models on limited hardware.


If You Remember Nothing Else

  • A 7B parameter model at full precision needs 28GB RAM; quantization makes it runnable on consumer hardware.
  • Q2/Q4/Q8 represent progressively less precise storage—Q4 strikes the best balance for most use cases.
  • Context quantization plus flash attention can cut memory usage by 10GB+ on large context windows.
  • Start at Q4, test your use case, then optimize—move up if quality suffers, down if it runs smooth.

Watch Out For

  • Context quantization benefits vary by model—tested one IBM model that used more memory with Q8 KV cache enabled than without it.
  • Demo showed results on Mac with unified memory; Windows/Linux GPU behavior may differ.
  • Memory measurements were eyeball estimates from ASITOP during runtime, not precise instrumentation.

════════════════════════════════════════