A long-form, animated walkthrough of how artificial intelligence evolved over seventy years — and how today's large language models actually think, call tools, and run on either a remote datacenter or the laptop on your desk.
70Years of AI
4Paradigm Shifts
1.8T+Parameters (Frontier)
∞Tools Reachable
Chapter 01
Seventy Years in Two Minutes
AI did not begin with ChatGPT. It is a slow accumulation of three ideas — symbolic reasoning, learning from data, and scale — punctuated by a few moments where everything changed at once.
1956
Dartmouth Workshop
McCarthy, Minsky, Shannon and Rochester coin the term "artificial intelligence" and set the field in motion.
1958
Perceptron
Rosenblatt builds the first learning neural network — a single-layer linear classifier on custom hardware.
1980s
Expert Systems
Hand-coded rule engines (XCON, MYCIN) automate niche professional reasoning. Brittle, but commercially real.
1986
Backpropagation
Rumelhart, Hinton and Williams popularize the algorithm that lets multi-layer networks actually learn.
1997
Deep Blue beats Kasparov
IBM's search-based engine wins a 6-game match. Brute force + heuristics, not learning — but a public watershed.
2012
AlexNet
Krizhevsky, Sutskever and Hinton crush ImageNet with a deep CNN on two GPUs. Modern deep learning era opens.
2014
GANs · Seq2Seq
Generative adversarial networks (Goodfellow) and encoder-decoder translation models redefine generation.
2017
Attention Is All You Need
Vaswani et al. publish the Transformer. Self-attention replaces recurrence — every modern LLM descends from this.
2018
BERT · GPT-1
Pretraining on raw text becomes the dominant recipe. Language models stop being task-specific.
2020
GPT-3
175B parameters. Few-shot prompting works. Scaling laws (Kaplan et al.) suggest the ride is far from over.
2022
ChatGPT
RLHF turns GPT-3.5 into a usable assistant. 100M users in 2 months — fastest consumer adoption in history.
2023
GPT-4 · Llama 2
Multimodal frontier closed models and the first competitive open-weights family ship within months of each other.
2024
Tool-Use & Agents
Function calling, MCP, computer-use. LLMs stop being chat boxes and start operating real software.
2025–26
Reasoning Models
o-series, Claude, Gemini reasoning variants spend inference compute on chain-of-thought. Local 70B-class models match 2023 frontier.
← scroll horizontally · 14 milestones →
Chapter 02
What Is a Neural Network?
A neural network is a graph of weighted multiplications and non-linear squashing functions. That's it. Everything else — vision, language, reasoning — is what emerges when you stack enough of them and feed them enough data.
Each connection carries a number. Training adjusts those numbers — billions of them — so the output gets closer to the right answer.
02
Activation
Each node sums its inputs and squashes the result through a non-linear function (ReLU, GELU). Without that step, the whole network collapses to a line.
03
Backprop
The error at the output flows backward through the graph. Each weight learns how much it contributed and nudges itself in the right direction.
Chapter 03
How an LLM Actually Thinks
A large language model does exactly one thing: it predicts the next token. Everything else — answering questions, writing code, holding a conversation — is a side effect of doing that very, very well.
01
Tokenization
Text → numbers
Your sentence is sliced into ~50,000-piece vocabulary chunks called tokens — sometimes whole words, often sub-words. Each token becomes an integer ID the model can address.
input:"The cat sat on the mat"→Thecatsatonthemat
ids:464379733323192622603
02
Embeddings
Numbers → meaning vectors
Each token ID is looked up in a giant matrix and becomes a vector — typically 4,096 to 16,384 numbers. Positions in that high-dimensional space encode meaning: 'king' and 'queen' end up near each other, and far from 'sandwich'.
The
dim 4096
cat
dim 4096
sat
dim 4096
on
dim 4096
the
dim 4096
mat
dim 4096
03
Self-Attention
Every token reads every other token
The Transformer's core trick. For each token, the model computes how much attention to pay to every previous token. That's how it knows which 'it' refers to which 'cat', or that a function argument relates to a return type 200 lines away.
The
cat
sat
on
the
mat
The
0.90
0.05
cat
0.15
0.70
0.05
sat
0.08
0.55
0.25
on
0.05
0.35
0.30
0.20
0.05
0.05
the
0.05
0.25
0.10
0.20
0.30
0.10
mat
0.62
0.05
0.05
0.20
attention heatmap · row = current token · column = what it looks at
04
Sampling
A probability distribution → one word
The final layer produces a probability for every token in the vocabulary. The model either picks the most likely (greedy), or samples with a temperature setting that controls how 'creative' it is. Then it loops — that one new token becomes input to the next prediction.
mat
42%
floor
21%
chair
13%
roof
4%
→ next token: "mat" · loop and continue
Chapter 04 · Deep Dive
Words Become Coordinates
Inside a model, every word, sentence, image, and snippet of code is a point in a high-dimensional space. Similar things land near each other. Most of what makes AI feel smart is just clever geometry on those points.
embedding space · 2D projection of ~3,072 dimensionst-SNE / UMAP
The classic Word2Vec demonstration. The direction from man to king is roughly the same as the direction from woman to queen — the model has learned an axis for "royalty" without ever being told the word.
cosine similarity · −1 to 1
king ↔ queen
0.86
king ↔ pizza
0.07
dog ↔ wolf
0.74
paris ↔ tokyo
0.62
code ↔ compiler
0.81
01
Hundreds of dimensions
Real embeddings live in 1,024–3,072 dimensions, not two. Each axis encodes some learned aspect of meaning — gender, formality, animacy, intent. We can only draw two of them, but the model uses all of them at once.
02
Distance = meaning
Cosine similarity between two vectors is how a model judges relatedness. Nearest-neighbor search over millions of embeddings is how RAG, semantic search, recommendation, and de-duplication all work under the hood.
03
Same trick for everything
Embed images, audio, code, even DNA — the same vector space lets you search across modalities. CLIP famously embedded text and pictures into a shared space, which is why you can search photos with a sentence.
Chapter 05
How the Model Became Smart
A frontier LLM is not one model — it is four trained on top of each other. Most of the cost lives in the first stage; most of the personality lives in the last three.
01
Pretraining
A trillion tokens of the internet
Predict the next word — over and over — across books, code, papers, and web pages. After ~10²⁵ FLOPs the model picks up grammar, facts, reasoning patterns, and the structure of dozens of languages without ever being told what any of them are.
~15T tokens
~6 months on 25,000 GPUs
02
Supervised fine-tuning
Show, don’t tell
Hand-written instruction → response pairs teach the base model what a helpful answer looks like. Now it stops auto-completing the prompt and starts addressing it.
~100K – 1M pairs
humans + curated demonstrations
03
RLHF
Humans rank, the model learns the ranking
For each prompt, generate two answers. Ask a human which is better. Train a reward model on those preferences, then use reinforcement learning to push the LLM toward higher-rewarded outputs.
preference data
reward model + PPO / DPO
04
Constitutional / RLAIF
The model critiques itself
Replace most of the human raters with another AI guided by a written constitution — a list of principles the model should respect. Faster, cheaper, and the rules are auditable text instead of a frozen reward model.
+ written principles
Anthropic’s approach
pretraining lossloss vs tokens seen
Loss falls fast in the first trillion tokens, then slows to a grind. The last few percent of capability cost more compute than everything before them.
RLHF · one preference round
response A
Sure! Here's a list of three reasons, with citations and a short summary at the end.
▲ chosen
response B
Yeah whatever, here you go.
▼ rejected
reward model
+0.78
Multiply this round by millions of preference comparisons and you have a numerical model of taste — strong enough to steer a 100B+ parameter network.
Scale vs alignment
Bigger pretraining gives a smarter base model — but a smarter model with no alignment is a worse product, not a better one. The frontier is in the post-training, not the parameter count.
Helpfulness vs safety
Push too hard for helpful and the model will help with anything. Push too hard for safe and it refuses to write a poem. Modern training treats this as a multi-objective optimisation, not a single dial.
Why models still hallucinate
Pretraining rewards plausible text, not true text. Post-training reduces hallucinations but never eliminates them — the model still has no built-in notion of uncertainty unless it was trained explicitly to express one.
Chapter 06
Thinking, Out Loud
A reasoning model is a regular LLM that has been trained to generate a long internal monologue before its final answer. It buys correctness with tokens — and sometimes that trade is the only one that works.
vanilla LLM0.4s
prompt: Three boxes are each mislabeled. One has apples, one oranges, one mixed. How many fruits must you draw to label them all correctly?
answer
Two — one from each labeled box.
wrong
reasoning model~12s · 4,200 thinking tokens
prompt: Three boxes are each mislabeled. One has apples, one oranges, one mixed. How many fruits must you draw to label them all correctly?
thinking trace
›Let me re-read the puzzle.
›There are 3 boxes. Each is mislabeled.
›I pick from the box labeled "mixed".
›If I draw an apple, that box must be "apples".
›Then the "apples" label is wrong, so it’s "oranges".
›And the "oranges" label is wrong, so it’s "mixed".
›One draw is enough. Verifying… consistent.
answer
One. Drawing from the box labeled "mixed" is enough.
correct
tree of thoughtexplore · prune · commit
Some training recipes encourage the model to fan out — try several short hypotheses, evaluate each, prune the bad ones, then commit. The visible answer is the survivor of a tournament the user never sees.
exploredkeptcommitted
Where reasoning helps
Math, formal logic, multi-step coding, debugging, planning. Anything where one wrong sub-step poisons the rest of the answer benefits from being able to backtrack.
The latency tax
A reasoning model can spend 10–60 seconds (and 5–20× the tokens) before its first visible output. Worth it for a hard answer; pure overhead for "what time is it in Paris".
The 2026 lineup
OpenAI o-series, Claude with extended thinking, Gemini Thinking, DeepSeek R1, Qwen QwQ. Each gives you a knob for how long the model is allowed to deliberate.
Chapter 07
Functions, Tools & Agents
A model on its own only knows what was in its training data. To do anything useful in the real world — read a database, send an invoice, search the web today — it needs to call code. That mechanism is called function calling (or tool use), and it's the difference between a chatbot and an agent.
⌖
web_search()fetch live information
∑
calculator()arithmetic, units, finance
⌥
sql_query()read your database
✉
send_email()trigger notifications
◊
create_invoice()business actions
⟳
browser_use()click, type, navigate
⟶ tool-use loop
User01
"Email John the Q3 numbers"
→
LLM02
I need data, then to send mail
→
sql_query03
SELECT revenue FROM q3
→
LLM04
Got $4.2M. Compose email.
→
send_email05
to: john@…
→
Done06
Sent. Confirmed to user.
The loop is simple: the model emits a structured request to call a function, your runtime executes the actual code, the result is handed back as a new message, and the model decides what to do next. The cycle continues — sometimes for dozens of steps — until there's nothing left to call. That's an agent.
Your code receives this, calls the real API, and returns the result. The model continues from there.
CHATBOT
No tools
Replies from training data only. Can be brilliant at language, useless at facts that change after the cutoff date.
AGENT
Tools + loop
Reads your CRM, runs SQL, sends Slack messages, schedules a call. Capability scales with the toolbelt you give it.
Chapter 08 · Practice
Prompts Are an Interface
Most of the gap between a useless answer and a useful one is in the prompt, not the model. The good news: prompting follows recognisable patterns, and almost all of them are about being specific in the right places.
anatomy of a working promptstack from outside in
system
You are a senior code reviewer. Be terse. Always cite line numbers.
You are an editor at The Economist. Cut filler. Replace abstract nouns with concrete verbs. Keep length within ±10%.
[paragraph]
before
38%
after
78%
Freeform → JSON schema
before
Extract the date and amount.
after
Return only valid JSON matching:
{"date": "YYYY-MM-DD", "amount_usd": number}
If either is missing, return null for that field.
before
55%
after
96%
Treat prompts like code. Version them, test them on a real eval set, and never edit a production prompt without a diff.
Chapter 09
Beyond Text
Text was just the first modality to fall. Today’s frontier models take any combination of words, pictures, sound, and video — and handle them as different views of the same shared embedding space.
four streams · one shared spaceunified token sequence
TEXT
words → tokens
IMAGE
16×16 patches → tokens
AUDIO
spectrogram strips → tokens
VIDEO
frames + time → tokens
shared space
unified tokens
frontier models · 2026 · what they accept natively
model
TEXT
IMAGE
AUDIO
VIDEO
Claude 4.x
●
●
–
–
GPT · o-series
●
●
●
●
Gemini 2.x
●
●
●
●
Llama 4
●
●
–
–
Qwen 3
●
●
●
●
Document understanding
Drop a 100-page contract PDF in. The model treats every page as an image, every paragraph as text, and answers questions across both — no OCR pipeline required.
Voice mode
Speech-in, speech-out, end to end. The same network plans the answer and shapes the prosody. Latencies have dropped from seconds to ~300 ms.
Video Q&A
Sample a few hundred frames + the audio track, embed them in the same space as text, and the model can answer "at what minute does the speaker contradict herself?".
Image generation
A diffusion or autoregressive head turns the same shared embeddings back into pixels. Edit a photo by describing the change in plain English.
Chapter 10
AI in the Real World
Tool use isn't hypothetical. Right now, AI models are folding proteins, designing entirely new molecules, controlling fusion reactors, and forecasting the weather better than the supercomputers they replaced. The most consequential application — and the most expensive problem AI has ever been pointed at — is the discovery of new medicines.
$2.6B
Cost per approved drug
10–15 yr
Discovery → market
90%
Clinical-trial failure rate
10⁶⁰
Drug-like molecules possible
⟶ traditional pipeline · attrition by stage
The 10,000-to-1 funnel
10,000
Candidate compounds
250
Hits in screening
10
Lead molecules
5
Preclinical
1
Approved drug
Of every 10,000 candidate molecules a chemist synthesises, roughly one survives clinical trials. AI is reshaping every stage of this funnel — narrowing the search space, designing molecules that don't exist yet, and predicting failure before a single test tube is filled.
How scientists actually use AI
Five concrete stages where models are now part of the lab — not as assistants, but as the engine doing the work.
01
Target identification
Which protein, mutation, or pathway should the drug attack?
Models read every paper, patent, and trial registry ever published, plus genomic and proteomic data from millions of patients. They build a knowledge graph of disease causality — and rank the proteins most likely to be druggable. Insilico's PandaOmics and BenevolentAI's graph engine do exactly this.
From amino-acid sequence to a 3D shape you can dock molecules into
A protein's function is determined by how it folds — and folding was an unsolved problem for 50 years. AlphaFold 2 (2020) and AlphaFold 3 (2024) collapsed it. Predictions accurate to within an atom, in seconds, for any sequence on Earth. The full structural proteome — over 200 million proteins — is now public.
seq: MKTAYIAKQRQISFVKSHFSRQLEERLG…→ 3D structure
03
Generative chemistry
Design new molecules that don't exist yet
Diffusion models and graph VAEs trained on tens of millions of known compounds learn the latent space of valid chemistry. Given a binding pocket, they generate novel molecules optimised for affinity, solubility, and synthesisability — searching a space of 10⁶⁰ possible drugs that no human team could enumerate.
generated · sample 0341 / 12,800
Binding affinity82%
Solubility71%
Synthesisability64%
Toxicity18%
Each generated candidate is scored on multiple objectives in parallel — the model learns to optimise all of them at once.
04
Automated wet labs · phenomics
Robots run the experiments, vision models read the results
Recursion's labs image millions of human cells per week, perturbed by thousands of compounds and gene knockouts. Self-supervised CNNs convert each image into an embedding — phenotypes that look the same end up in the same neighbourhood, revealing molecules that 'rescue' diseased cells without anyone needing to know the mechanism.
confirmed rescueneighbour in embeddingbaseline / no effect
05
Clinical-trial acceleration
Predict failure before the patient enrolment opens
Models trained on decades of historical trials predict which compounds will fail Phase 2 toxicity, suggest patient-stratification cohorts, and even generate digital twin control arms. Less wasted compute, less wasted years, less wasted human risk.
MOL-091
18%pass
MOL-092
74%flag
MOL-093
92%fail
MOL-094
32%pass
MOL-095
61%flag
Predicted Phase-2 failure risk · trained on 30+ years of historical trial outcomes. Flagged compounds are re-engineered before a single patient is recruited.
The toolbelt of a 2026 computational scientist
Six platforms doing the heaviest lifting today. Some are open weights you can run on a workstation; some are commercial pipelines worth multi-billion-dollar deals.
AlphaFold 3
structure
Google DeepMind / Isomorphic
Predicts the 3D structure of proteins, DNA, RNA, and ligand complexes from sequence alone. Solved 200M+ structures publicly.
RFdiffusion
design
Baker Lab, U. of Washington
A diffusion model that designs entirely new proteins from scratch — binders, enzymes, scaffolds. Won the 2024 Nobel in Chemistry.
Boltz-1 / Chai-1
docking
MIT · Chai Discovery
Open-weights successors to AlphaFold for protein–ligand docking. Lab-runnable, no API gatekeeping.
GNoME
materials
Google DeepMind
2.2 million new crystal structures predicted — 380,000 stable. A 800-year leap in materials science in one model.
Pharma.AI
pipeline
Insilico Medicine
End-to-end pipeline: target discovery (PandaOmics) + generative chemistry (Chemistry42). First AI-designed drug now in Phase 2.
Recursion OS
phenomics
Recursion Pharmaceuticals
Robotic labs run millions of cell-imaging experiments per week; CNNs cluster phenotypes to find drug candidates by visual similarity.
Companies already shipping
Not research papers — actual molecules in actual humans, or partnerships where Big Pharma is paying real money for AI-designed candidates.
Isomorphic Labs
Alphabet · DeepMind spin-out
$3B+
partnership value
AlphaFold-powered drug design
Founded 2021. Partnerships with Eli Lilly and Novartis worth $3B+ in milestones. Uses AlphaFold 3 to model how candidate molecules interact with disease-causing proteins — collapsing months of crystallography into minutes.
Insilico Medicine
Hong Kong · NYC
30 mo.
discovery → clinic
First end-to-end AI-designed drug in human trials
INS018_055 — a treatment for idiopathic pulmonary fibrosis (IPF) — was discovered, designed, and brought to Phase 1 in under 30 months for ~$3M. Now in Phase 2 trials, the first drug where both the target and the molecule came from AI.
Recursion
Salt Lake City · NASDAQ: RXRX
~50M
experiments / week
Phenomics + automated wet labs
Robotics image millions of human cells under thousands of perturbations every week. Self-supervised vision models cluster phenotypes; matches reveal which molecules rescue diseased cells. 10+ programs in or near the clinic.
BenevolentAI
London · LSE: BAI
1B+
graph relations
Knowledge-graph reasoning over biomedical literature
A graph of 1B+ relationships from papers, patents, and clinical data. In 2020 their model proposed baricitinib for COVID-19 within 48 hours; the FDA later approved it. Now applied to ALS, ulcerative colitis, and chronic kidney disease.
Beyond medicine — six other frontiers
The same recipe — train a large model on a domain's data, then let it generate or predict what experiments would have taken decades to find. It works almost everywhere it's been tried.
Weather
GraphCast · Aurora
A graph neural network that beats the European supercomputer model on 10-day forecasts — and runs in under a minute on a single TPU instead of hours on a cluster.
Fusion
DeepMind × EPFL
Reinforcement learning controls the magnetic coils of a tokamak in real time, holding plasma in shapes humans never managed to stabilise — a step toward commercial fusion.
Mathematics
AlphaProof · AlphaGeometry 2
Solved 4 of 6 problems at the 2024 International Math Olympiad — silver-medal performance. Geometry was solved by combining a language model with a symbolic deduction engine.
Astronomy
LIGO · Vera Rubin pipelines
CNNs scan gravitational-wave streams for black-hole mergers in real time, and triage tens of millions of nightly transient detections from new sky surveys.
Climate
NeuralGCM
Hybrid neural / physics climate model from Google. Atmospheric simulations 100,000× cheaper than the legacy spectral solvers used by national weather services.
Robotics
RT-2 · Optimus · Figure
Vision-language-action models map a camera frame and a sentence (“pick up the red mug”) directly to motor torques — a generalist policy instead of bespoke per-task code.
None of these systems are general intelligence. They are narrow, domain-specific function approximators trained on data nobody could sift through manually. The shift is that the bottleneck in science used to be human imagination over a tiny search space; the bottleneck is now wet-lab validation of a search space the models can canvas in an afternoon.
Chapter 11
How "Smart" Gets Measured
Every model release ships with a battery of benchmark scores. They're a microscope, not a mirror — useful for comparing neighbours, dangerous if you confuse them with the territory. Numbers below are illustrative of where the frontier sits in 2026.
MMLU
/ 100
undergraduate-level general knowledge across 57 subjects
Claude Opus 4.x
89
GPT-5
91
Gemini 2.x Pro
90
Llama 4
84
DeepSeek V3
86
HumanEval
/ 100
164 Python coding problems with hidden unit tests
Claude Opus 4.x
95
GPT-5
94
Gemini 2.x Pro
90
Qwen 3-Coder
92
DeepSeek V3
89
GPQA Diamond
/ 100
PhD-level multiple choice in biology, physics, chemistry
o-series reasoning
78
Claude (extended thinking)
75
Gemini 2.x Thinking
73
DeepSeek R1
71
Vanilla Claude Sonnet
58
SWE-Bench Verified
/ 100
real GitHub issues — patch must pass the existing test suite
Claude Opus 4.x · agent
72
GPT-5 · agent
68
Gemini 2.x · agent
60
Qwen 3-Coder · agent
55
Best 2024 model
42
A benchmark is a microscope, not a mirror.
The only number that matters is how the model performs on your data and your task. Build a 50-example eval set on day one — every model decision after that gets easier.
What each one measures
MMLU = knowledge. HumanEval = isolated coding. GPQA = reasoning under uncertainty. SWE-Bench = doing real engineering work end to end. None of them measures whether the model is actually useful for your job.
The saturation problem
Top models now sit within a few points of each other on MMLU and HumanEval. The benchmarks have stopped discriminating — newer ones like FrontierMath, ARC-AGI-2, and SWE-Bench Verified are taking their place.
The gap to real work
A model that scores 95 on HumanEval can still fail at fixing your bug. Synthetic tasks reward narrow skill; production code rewards reading 10 files, running tests, and arguing with the linter.
Chapter 12
The 2026 Model Landscape
Eight families dominate production traffic. Half are closed APIs run by their creators; half are open weights you can download, fine-tune, and host yourself. Pick by task, cost, and where the data is allowed to live.
Anthropic
Claude
closed
Opus 4.x · Sonnet 4.x
context
200K – 1M tokens
hosting
cloud
long-context reasoningcodingtool usesafety
Constitutional AI alignment. Strong at multi-step agent loops.
OpenAI
GPT / o-series
closed
GPT-4.x · o3 · o4
context
128K – 1M tokens
hosting
cloud
general purposereasoningvoiceimages
Reasoning variants spend inference compute on chain-of-thought.
Google
Gemini
closed
Gemini 2.x Pro / Flash
context
up to 2M tokens
hosting
cloud
multimodalhuge contextnative video
Tightly integrated with Google services. Strong on image + video.
Meta
Llama
open weights
Llama 3.x · 4
context
128K tokens
hosting
cloud or local
open weightsfine-tunablestrong base
The de-facto open foundation. Runs on your hardware if you have the RAM.
Mistral
Mistral / Mixtral
open weights
Large 2 · Mixtral 8×22B
context
32K – 128K tokens
hosting
cloud or local
MoE efficiencymultilingualcompact
European, Apache-licensed. Mixture-of-experts gives big-model quality at small-model cost.
xAI
Grok
closed
Grok 3 / 4
context
128K+ tokens
hosting
cloud
real-time datalong reasoning
Tight integration with X. Trained on a very large compute cluster.
DeepSeek
DeepSeek
open weights
V3 · R1
context
128K tokens
hosting
cloud or local
costreasoningopen R1 weights
Open reasoning model that rattled the market in early 2025.
Alibaba
Qwen
open weights
Qwen 3 / 3-Coder
context
128K – 1M tokens
hosting
cloud or local
multilingualcodingsmall + large variants
Extremely strong open family across many sizes. Great at Asian languages.
Benchmarks shift week to week and rarely match real-world performance on your task. The right answer is usually: pick two candidates from different vendors, build the same eval set on your own data, and let the numbers decide.
Chapter 13
The Economics of Inference
A frontier model costs ~400× as much per token as a local one. The hard part of building with AI is no longer "can it do this?" — it's "which tier should I be paying for, and where?".
price ladder · USD per 1M tokensillustrative · 2026
FrontierOpus 4.x · GPT-5 · Gemini 2.x Pro
$15 in / $75 out
MidSonnet 4.x · GPT-5 mini · Gemini Flash
$3 in / $15 out
CheapHaiku · GPT-5 nano · open weights
$0.25 in / $1.25 out
LocalLlama 4 / Qwen 3 · Mac Studio / RTX
~$0 · electricity
Output tokens cost 4–5× input tokens because they're generated one at a time, with full GPU memory pressure each step. That's why "answer in JSON, not prose" can quietly halve your bill.
same task · different tier
Summarise 100 PDFs · ~1.2M in / 200K out
Frontier
$18.40
Mid
$4.10
Cheap
$0.42
Local
$0.05
The frontier tier is ~370× the local tier on paper. Whether it's worth that gap depends entirely on whether you can tell the difference in the output.
scaling laws · the Chinchilla insightlog-log
DeepMind's 2022 result: at any given compute budget, the best model is smaller than people thought, but trained on far more tokens. The race for parameter count was partly a misallocation — and that's why a well-trained 70B can beat an under-trained 500B.
Pick a tier per task, not per app
Classification, extraction, simple summaries — the cheap tier is enough. Save the frontier for the steps that actually need it: planning, debugging, novel reasoning.
Caching is a 60–90% discount
Provider-side prompt caching reuses the system prompt and long context across calls. For a stable agent loop, the difference is measured in orders of magnitude on the bill.
Local cost ≈ electricity
Once the GPU is bought, running a local 70B model is a few cents per million tokens — but only if you have steady throughput to amortise the hardware.
Chapter 14
Local vs Cloud Models
The single most consequential architectural decision in any AI system: does the model run inside your infrastructure, or do you send every prompt to someone else's GPUs? Both are valid — they trade different things.
Complianceeasier (HIPAA, GDPR, on-prem)depends on provider DPA
Updatesyou re-pull weightsautomatic
Multimodallimited (image/audio LLMs grow)native voice + video
Scalingadd GPUs → linearelastic, instant
The hybrid answer
Most production systems we build at Vorcl are hybrid: frontier closed models do the hardest reasoning, a fine-tuned local model handles the high-volume, sensitive, or repetitive work, and a router decides per request. Privacy and cost stay bounded; quality stays at the ceiling where it matters.
Reference
Twenty Words That Cover the Field
A short, pin-this-to-the-fridge glossary. If you remember these, you can read almost any AI paper, blog post, or release note without getting lost.
token
01
A chunk of text the model sees as one unit. ~4 chars on average. ~50K of them in the vocabulary.
parameter
02
A learned number inside the network. Frontier models have 100B – 2T of them.
context window
03
How much the model can read at once, measured in tokens. 200K is comfortable; 1M is the new ceiling.
embedding
04
A list of numbers that represents meaning. Words, images, and audio can all become embeddings.
attention
05
The Transformer trick that lets every token decide how much each other token matters.
transformer
06
The neural net architecture, introduced in 2017, that almost every modern LLM is built on.
fine-tuning
07
Continuing training on a smaller, task-specific dataset to nudge a base model toward a behaviour.
RLHF
08
Reinforcement Learning from Human Feedback. Humans rank outputs; the model learns to prefer the winners.
temperature
09
A sampling knob. 0 = always pick the most likely token. Higher = more creative, more chaotic.
top-p
10
Nucleus sampling. Keep only tokens whose cumulative probability sums to p, then pick from those.
hallucination
11
When a model confidently states something untrue. A side effect of being trained to sound right, not be right.
RAG
12
Retrieval-Augmented Generation. Look up relevant docs first, paste them into the prompt, then answer.
agent
13
An LLM in a loop that can call tools, observe their output, and decide what to do next.
tool use
14
The mechanism that lets a model emit a JSON function call instead of free-form text.
MCP
15
Model Context Protocol. A standard for letting any model connect to any tool or data source.
reasoning model
16
A model trained to generate a long internal chain of thought before its final answer.
multimodal
17
Accepts more than one kind of input — text, images, audio, video — and reasons across them.
quantization
18
Compressing model weights from 16-bit floats to 8 or 4 bits. Smaller, faster, slightly dumber.
MoE
19
Mixture of Experts. Only a subset of parameters fire per token, so a 400B model runs like a 40B.
distillation
20
Training a small model to imitate a big one. Cheap inference, most of the quality.
End of lesson
Now Put It to Work.
You've seen how the model works. The hard part is choosing the right one for your data, wiring it into your stack, and making it safe to put in front of customers. That's the job we do.