EP09 · 8 min

LLM mechanics: tokens, context window, next-token, temperature/top-p

Understand how LLM text generation works and how decoding controls affect outputs.

Simple definition
LLMs generate text by repeatedly predicting the next token.
Precise definition
Autoregressive language models estimate conditional token distributions over a finite context and sample from that distribution using decoding constraints.

Objective

You should be able to tune model behavior intentionally instead of by guesswork.

Generation loop

The model reads prompt tokens, predicts a probability distribution for the next token, samples one, appends it, and repeats.

This explains why outputs can drift: each chosen token changes the future context.

Key controls

  • Temperature: flattens or sharpens probability spread.
  • Top-p: keeps a probability mass subset and samples within it.
  • Max tokens: caps response length.

Context window

The model can only condition on text inside its window. If crucial facts are outside, it cannot reliably use them.

For support automation, this means retrieval and summarization strategy matter as much as prompt wording.

Worked example (online store)

Customer asks for refund policy plus shipment status plus coupon exception. If you paste too much history, policy text might fall out of context. The answer then sounds fluent but misses key policy details.

Practical defaulting

For deterministic business responses, start with lower temperature and structured output constraints.

For brainstorming copy variants, increase temperature moderately.

Three takeaways

  • Decoding settings are product controls.
  • Context management is reliability engineering.
  • Next-token generation explains both power and fragility.

Visual Stage

Interactive walkthrough

Visual stage: token generation loop

Tap each step in the autoregressive cycle.

Step Insight

Prompt text is converted into token IDs, which are model-readable units.

Common traps
  • Treating high-confidence tone as evidence of correctness.
  • Over-increasing temperature when precision is required.
  • Ignoring context window limits when prompts are long.
Three takeaways
  • Tokens are model units, not words.
  • Context window is a hard operational constraint.
  • Sampling settings trade off creativity vs consistency.
Next lesson