Why this matters
Once you understand how a model “reads” your prompt and “writes” its response, every other lesson in this course makes more sense. This module gives you the mental model that powers good prompting.
Tokens — the AI’s unit of language
AI models don’t see words. They see tokens. A token is roughly:
- A short word (“cat” = 1 token)
- A piece of a longer word (“understanding” = 3 tokens: “under”, “stand”, “ing”)
- A punctuation mark or space
A rough rule: 1 token ≈ 0.75 words. So 1,000 tokens ≈ 750 words ≈ a long email.
Why this matters:
- Pricing is usually per-token (when you use the API)
- Context windows (how much the AI can “see”) are measured in tokens
- Every word in your prompt costs something — be clear and concise.
Context windows
The context window is the maximum amount of text a model can process at once — your prompt plus its response, plus any uploaded files or chat history.
Modern context windows (as of 2026):
- ChatGPT: ~128K-200K tokens (~100-150K words)
- Claude: 200K tokens (~150K words), with 1M for some models
- Gemini: 1M-2M tokens (~750K-1.5M words)
What this means practically:
- You can paste an entire book into Claude or Gemini and ask questions about it
- Once you exceed the window, the model starts “forgetting” the earliest parts of the conversation
- Long conversations eventually drift — start fresh when output quality drops
A practical habit: when a conversation gets long or you’ve finished a major task, either condense it or start fresh.
- Condense — ask the model to summarize the conversation so far in a few bullets, then paste that summary into a new chat as the starting context. Keeps continuity, drops the bloat.
- Clear — start a fresh conversation for each new task. Cheapest, sharpest output. Use this when the new task doesn’t need history.
Old context bloats every new prompt with the model’s earlier reasoning — which costs more tokens and increases drift. A clean (or condensed) conversation produces sharper output.
How models are trained (high level)
You don’t need to be technical, but a quick mental model helps. Training happens in stages — pretraining is the heavy lift, and the next two stages are layered on top of the pretrained model afterward to shape behavior:
-
Pretraining (the main training) — The model reads a huge portion of the internet (books, articles, websites, code). It learns patterns: which words tend to follow which, how arguments are structured, how code is written. This takes months and millions of dollars. At the end of pretraining, you have a “base model” — knowledgeable but raw, not yet useful as a chatbot.
-
Fine-tuning (post-training) — The base model is then refined for specific tasks (chat, coding, safety). This shapes its behavior — making it helpful, harmless, and honest. It happens after pretraining, on top of the existing model.
-
RLHF (Reinforcement Learning from Human Feedback) (post-training) — Humans rate model outputs. The model learns to produce responses humans prefer. Also layered on after pretraining. This is why modern AI feels conversational instead of robotic.
The model you talk to is the result of all three stages. It’s not trained on what you tell it — it’s frozen at whatever date its training cut off.
Why prompt structure changes output quality
Because the model predicts the next token based on everything before it, the way you structure your prompt directly shapes the answer.
A prompt like “tell me about marketing” gives the model almost no signal. Tens of thousands of marketing-related sentences could plausibly follow. The model picks something generic.
A prompt like “Explain the difference between content marketing and paid advertising for a small bakery in 200 words, in plain language, with one example each” narrows the possibility space dramatically. The model has clear constraints and produces a focused, useful answer.
Specificity is leverage. Every detail you add eliminates entire categories of bad responses.
The two modes — instant vs thinking
Modern models like ChatGPT-5.1, Claude Opus, and Gemini 3 have two modes:
Instant — Fast default response. Best for:
- Quick questions
- Drafting and brainstorming
- Casual conversation
- Tasks where speed matters more than depth
Thinking (also called “Reasoning” or “Extended Thinking”) — The model spends more compute internally before answering. Best for:
- Complex analysis
- Math and logic problems
- Multi-step planning
- Long documents that need careful reading
Important: More thinking isn’t always better. For simple tasks, thinking mode can produce worse results because the model overcomplicates things. Match the mode to the task.
Key takeaways
- AI sees tokens, not words — be efficient but clear
- Context windows are huge in 2026, but conversations still drift over time
- Models are frozen at their training cutoff — they don’t learn from you
- Prompt structure directly shapes output quality
- Use Instant for daily tasks, Thinking for big decisions
Quick Check
1. A token is best described as:
2. Roughly, 1,000 tokens equals about how many words?
3. Which stage of training happens first and is the heaviest lift?
4. Why does prompt structure change output quality so much?
5. When should you use "Thinking" / Reasoning mode?