Objective
You should be able to tune model behavior intentionally instead of by guesswork.
Generation loop
The model reads prompt tokens, predicts a probability distribution for the next token, samples one, appends it, and repeats.
This explains why outputs can drift: each chosen token changes the future context.
Key controls
- Temperature: flattens or sharpens probability spread.
- Top-p: keeps a probability mass subset and samples within it.
- Max tokens: caps response length.
Context window
The model can only condition on text inside its window. If crucial facts are outside, it cannot reliably use them.
For support automation, this means retrieval and summarization strategy matter as much as prompt wording.
Worked example (online store)
Customer asks for refund policy plus shipment status plus coupon exception. If you paste too much history, policy text might fall out of context. The answer then sounds fluent but misses key policy details.
Practical defaulting
For deterministic business responses, start with lower temperature and structured output constraints.
For brainstorming copy variants, increase temperature moderately.
Three takeaways
- Decoding settings are product controls.
- Context management is reliability engineering.
- Next-token generation explains both power and fragility.