# You've used ChatGPT a thousand times. You still can't explain what happens be...
Canonical: https://social-archive.org/yena/96C8qHDfnk
Original URL: https://x.com/NainsiDwiv50980/status/2071855842966798824
Author: Nainsi Dwivedi
Platform: x
## Content
You've used ChatGPT a thousand times. You still can't explain what happens between hitting "enter" and the first word appearing. That gap is exactly why most "AI Engineers" stay stuck at the API-call level — and why the ones who understand the pipeline get paid 3x more. Here's what actually happens when an LLM generates a single word 👇 You type "What is gravity?" 1. Tokenizer Your sentence isn't text to the model. It's broken into tokens → "grav", "ity" → then mapped to IDs like [26110, 879]. The model never sees words. It sees numbers. 2. Embeddings Each token ID becomes a 4096-dimensional vector. This is where "meaning" starts to live — words become geometry. 3. Transformer Blocks (×96 in GPT-4 class models) This is the engine. Two parts: → Self-Attention (Q, K, V): every token looks at every other token to figure out context. "It" knows what "it" refers to. → Feed-Forward Network: where most of the actual "thinking" and stored knowledge lives. Repeat this 96 times. That's the depth. 4. The part nobody talks about → KV Cache Here's the bottleneck that decides your entire infra bill. The model runs in two phases: • Prefill — processes your whole prompt in parallel (compute-bound) • Decode — generates ONE token at a time (memory-bound) To avoid recomputing everything per token, it caches previous Key/Value pairs. But that cache grows linearly with sequence length. Long context = exploding memory. This is why your tokens cost what they cost. 5. Linear + Softmax (LM Head) The final vector gets projected over a 128K-word vocabulary, turned into probabilities. Now the model has a ranked guess for the next token. 6. Sampling Strategy This is where "personality" comes from: → Greedy: always pick the top token (boring, deterministic) → Top-K / Top-P: sample from the likely candidates → Temperature: low = factual, high = creative Same model. Wildly different output. Just from this knob. 7. Speculative Decoding (the speed trick) A small draft model guesses 4 tokens ahead, the big model verifies them in parallel. Accept the good ones, reject the rest. This is how you get fast responses without a bigger GPU. 8. Detokenizer Numbers → back into readable text. 9. Streaming Output Tokens appear one by one. That "typing" effect isn't a UI animation. It's the model literally thinking one token at a time. Here's the uncomfortable truth: The people shipping real AI products in 2026 aren't memorizing this for interviews. They understand it because every one of these stages is a lever — for cost, latency, quality, and scale. Miss the pipeline, and you'll forever be guessing why your app is slow, expensive, or hallucinating. Understand it, and you stop being someone who *calls* the model — and become someone who *engineers* it. The window where this knowledge is rare is closing fast.
