Module 2 of 9

Module 2: How LLMs Process Language

Why this matters

Once you understand how a model “reads” your prompt and “writes” its response, every other lesson in this course makes more sense. This module gives you the mental model that powers good prompting.

Tokens — the AI’s unit of language

AI models don’t see words. They see tokens. A token is roughly:

A short word (“cat” = 1 token)
A piece of a longer word (“understanding” = 3 tokens: “under”, “stand”, “ing”)
A punctuation mark or space

A rough rule: 1 token ≈ 0.75 words. So 1,000 tokens ≈ 750 words ≈ a long email.

Why this matters:

Pricing is usually per-token (when you use the API)
Context windows (how much the AI can “see”) are measured in tokens
Every word in your prompt costs something — be clear and concise.

Context windows

The context window is the maximum amount of text a model can process at once — your prompt plus its response, plus any uploaded files or chat history.

Modern context windows (as of 2026):

ChatGPT: ~128K-200K tokens (~100-150K words)
Claude: 200K tokens (~150K words), with 1M for some models
Gemini: 1M-2M tokens (~750K-1.5M words)

What this means practically:

You can paste an entire book into Claude or Gemini and ask questions about it
Once you exceed the window, the model starts “forgetting” the earliest parts of the conversation
Long conversations eventually drift — start fresh when output quality drops

A practical habit: when a conversation gets long or you’ve finished a major task, either condense it or start fresh.

Condense — ask the model to summarize the conversation so far in a few bullets, then paste that summary into a new chat as the starting context. Keeps continuity, drops the bloat.
Clear — start a fresh conversation for each new task. Cheapest, sharpest output. Use this when the new task doesn’t need history.

Old context bloats every new prompt with the model’s earlier reasoning — which costs more tokens and increases drift. A clean (or condensed) conversation produces sharper output.

How models are trained (high level)

You don’t need to be technical, but a quick mental model helps. Training happens in stages — pretraining is the heavy lift, and the next two stages are layered on top of the pretrained model afterward to shape behavior:

Pretraining (the main training) — The model reads a huge portion of the internet (books, articles, websites, code). It learns patterns: which words tend to follow which, how arguments are structured, how code is written. This takes months and millions of dollars. At the end of pretraining, you have a “base model” — knowledgeable but raw, not yet useful as a chatbot.
Fine-tuning (post-training) — The base model is then refined for specific tasks (chat, coding, safety). This shapes its behavior — making it helpful, harmless, and honest. It happens after pretraining, on top of the existing model.
RLHF (Reinforcement Learning from Human Feedback) (post-training) — Humans rate model outputs. The model learns to produce responses humans prefer. Also layered on after pretraining. This is why modern AI feels conversational instead of robotic.

The model you talk to is the result of all three stages. It’s not trained on what you tell it — it’s frozen at whatever date its training cut off.

Why prompt structure changes output quality

Because the model predicts the next token based on everything before it, the way you structure your prompt directly shapes the answer.

A prompt like “tell me about marketing” gives the model almost no signal. Tens of thousands of marketing-related sentences could plausibly follow. The model picks something generic.

A prompt like “Explain the difference between content marketing and paid advertising for a small bakery in 200 words, in plain language, with one example each” narrows the possibility space dramatically. The model has clear constraints and produces a focused, useful answer.

Specificity is leverage. Every detail you add eliminates entire categories of bad responses.

The two modes — instant vs thinking

Modern models like ChatGPT-5.1, Claude Opus, and Gemini 3 have two modes:

Instant — Fast default response. Best for:

Quick questions
Drafting and brainstorming
Casual conversation
Tasks where speed matters more than depth

Thinking (also called “Reasoning” or “Extended Thinking”) — The model spends more compute internally before answering. Best for:

Complex analysis
Math and logic problems
Multi-step planning
Long documents that need careful reading

Important: More thinking isn’t always better. For simple tasks, thinking mode can produce worse results because the model overcomplicates things. Match the mode to the task.

Key takeaways

AI sees tokens, not words — be efficient but clear
Context windows are huge in 2026, but conversations still drift over time
Models are frozen at their training cutoff — they don’t learn from you
Prompt structure directly shapes output quality
Use Instant for daily tasks, Thinking for big decisions