You've used ChatGPT a thousand times.
You still can't explain what happens between hitting "enter" and the first word appearing.
That gap is exactly why most "AI Engineers" stay stuck at the API-call level โ and why the ones who understand the pipeline get paid 3x more.
Here's what actually happens when an LLM generates a single word ๐
You type "What is gravity?"
- Tokenizer
Your sentence isn't text to the model. It's broken into tokens โ "grav", "ity" โ then mapped to IDs like [26110, 879]. The model never sees words. It sees numbers.
- Embeddings
Each token ID becomes a 4096-dimensional vector. This is where "meaning" starts to live โ words become geometry.
- Transformer Blocks (ร96 in GPT-4 class models)
This is the engine. Two parts:
โ Self-Attention (Q, K, V): every token looks at every other token to figure out context. "It" knows what "it" refers to.
โ Feed-Forward Network: where most of the actual "thinking" and stored knowledge lives.
Repeat this 96 times. That's the depth.
- The part nobody talks about โ KV Cache
Here's the bottleneck that decides your entire infra bill.
The model runs in two phases:
โข Prefill โ processes your whole prompt in parallel (compute-bound)
โข Decode โ generates ONE token at a time (memory-bound)
To avoid recomputing everything per token, it caches previous Key/Value pairs. But that cache grows linearly with sequence length. Long context = exploding memory. This is why your tokens cost what they cost.
- Linear + Softmax (LM Head)
The final vector gets projected over a 128K-word vocabulary, turned into probabilities. Now the model has a ranked guess for the next token.
- Sampling Strategy
This is where "personality" comes from:
โ Greedy: always pick the top token (boring, deterministic)
โ Top-K / Top-P: sample from the likely candidates
โ Temperature: low = factual, high = creative
Same model. Wildly different output. Just from this knob.
- Speculative Decoding (the speed trick)
A small draft model guesses 4 tokens ahead, the big model verifies them in parallel. Accept the good ones, reject the rest. This is how you get fast responses without a bigger GPU.
- Detokenizer
Numbers โ back into readable text.
- Streaming Output
Tokens appear one by one. That "typing" effect isn't a UI animation. It's the model literally thinking one token at a time.
Here's the uncomfortable truth:
The people shipping real AI products in 2026 aren't memorizing this for interviews.
They understand it because every one of these stages is a lever โ for cost, latency, quality, and scale.
Miss the pipeline, and you'll forever be guessing why your app is slow, expensive, or hallucinating.
Understand it, and you stop being someone who calls the model โ and become someone who engineers it.
The window where this knowledge is rare is closing fast.