n LLMs for the Regarded (A Survival Manual for the Locally Curious) You Don’t Talk to a Model — You Talk to Math When you 'run' a model, you’re doing inference — not magic, just prediction. It chops your words into chunks called tokens, turns them into numbers, and guesses what comes next. Then it adds that token and guesses again, one token at a time. Every sentence is just high-speed autocomplete with a caffeine problem. Tokens Are Not Words Models don’t read words; they read tokens. 'Hello' might be one token, 'Internationalization' could be eight, and ' n ' is one token of pure chaos. Models can only see a certain number of tokens at once (context window). More context = slower model + hotter GPU. The Brains Are Just Numbers Inside every model live billions of tiny numbers called weights. They’re patterns, not facts — knobs that help the model guess better. Loading those weights into memory is why your GPU screams for mercy. Transformers — The Skeleton Crew Transformers are stacks of repeating layers that decide which previous words matter (self-attention) and process that info (feed-forward). Do that 30–100 times and boom — cat facts. VRAM: The Real Boss All those weights and calculations live in your GPU memory. A 7B model in full quality = ~14GB. Quantization shrinks it (8-bit ~7GB, 4-bit ~3.5GB). KV cache grows as your chat does — like a goldfish that refuses to die. Quantization: The Art of Making It Fit Quantization means telling the model, 'You don’t need 32-bit precision to tell me a frog joke.' You save memory and speed, but sometimes it forgets math or hallucinates history. 4-bit is the sweet spot for sanity. Generation: The Token Factory Once loaded, the model loops: predict next token → add → repeat. You can control randomness (temperature) or force logic (greedy). Each new token re-runs the whole model — your GPU’s personal Groundhog Day. Why Run Locally? Control, privacy, speed. No API fees or internet dependency. You tweak everything yourself — decoding, context, quantization. You also learn why your PC sounds like a jet engine. The Gotchas Nobody Tells You Out of memory? Quantize. Gibberish? Wrong chat template. Slow? Offloading to CPU. Unsafe? Avoid random .bin files; use safetensors. The Big Picture Running an LLM is simple: Text → Tokens → Numbers → Math → Probabilities → Token → Repeat. It’s not magic; it’s statistics with a side of silicon sweat. The Gospel of Local LLMs They’re math, not magic. VRAM is your god. Quantization is your religion. FlashAttention is your prayer. Chat templates are your commandments. Temperature = chaos. Congratulations — you now understand local LLMs better than half the internet. Next time someone says 'just run this script,' ask: “Cool, but what precision are those weights?” Then enjoy the silence.