EducationHow AI Works

From Perceptrons to Reasoning Agents

A long-form, animated walkthrough of how artificial intelligence evolved over seventy years — and how today's large language models actually think, call tools, and run on either a remote datacenter or the laptop on your desk.

70Years of AI

4Paradigm Shifts

1.8T+Parameters (Frontier)

∞Tools Reachable

Chapter 01

Seventy Years in Two Minutes

AI did not begin with ChatGPT. It is a slow accumulation of three ideas — symbolic reasoning, learning from data, and scale — punctuated by a few moments where everything changed at once.

1956

Dartmouth Workshop

McCarthy, Minsky, Shannon and Rochester coin the term "artificial intelligence" and set the field in motion.

1958

Perceptron

Rosenblatt builds the first learning neural network — a single-layer linear classifier on custom hardware.

1980s

Expert Systems

Hand-coded rule engines (XCON, MYCIN) automate niche professional reasoning. Brittle, but commercially real.

1986

Backpropagation

Rumelhart, Hinton and Williams popularize the algorithm that lets multi-layer networks actually learn.

1997

Deep Blue beats Kasparov

IBM's search-based engine wins a 6-game match. Brute force + heuristics, not learning — but a public watershed.

2012

AlexNet

Krizhevsky, Sutskever and Hinton crush ImageNet with a deep CNN on two GPUs. Modern deep learning era opens.

2014

GANs · Seq2Seq

Generative adversarial networks (Goodfellow) and encoder-decoder translation models redefine generation.

2017

Attention Is All You Need

Vaswani et al. publish the Transformer. Self-attention replaces recurrence — every modern LLM descends from this.

2018

BERT · GPT-1

Pretraining on raw text becomes the dominant recipe. Language models stop being task-specific.

2020

GPT-3

175B parameters. Few-shot prompting works. Scaling laws (Kaplan et al.) suggest the ride is far from over.

2022

ChatGPT

RLHF turns GPT-3.5 into a usable assistant. 100M users in 2 months — fastest consumer adoption in history.

2023

GPT-4 · Llama 2

Multimodal frontier closed models and the first competitive open-weights family ship within months of each other.

2024

Tool-Use & Agents

Function calling, MCP, computer-use. LLMs stop being chat boxes and start operating real software.

2025–26

Reasoning Models

o-series, Claude, Gemini reasoning variants spend inference compute on chain-of-thought. Local 70B-class models match 2023 frontier.

← scroll horizontally · 14 milestones →

Chapter 02

What Is a Neural Network?

A neural network is a graph of weighted multiplications and non-linear squashing functions. That's it. Everything else — vision, language, reasoning — is what emerges when you stack enough of them and feed them enough data.

forward pass · learned weights · non-linear activation

Weights

Each connection carries a number. Training adjusts those numbers — billions of them — so the output gets closer to the right answer.

Activation

Each node sums its inputs and squashes the result through a non-linear function (ReLU, GELU). Without that step, the whole network collapses to a line.

Backprop

The error at the output flows backward through the graph. Each weight learns how much it contributed and nudges itself in the right direction.

Chapter 03

How an LLM Actually Thinks

A large language model does exactly one thing: it predicts the next token. Everything else — answering questions, writing code, holding a conversation — is a side effect of doing that very, very well.

Tokenization

Text → numbers

Your sentence is sliced into ~50,000-piece vocabulary chunks called tokens — sometimes whole words, often sub-words. Each token becomes an integer ID the model can address.

input:"The cat sat on the mat"→Thecatsatonthemat

ids:464379733323192622603

Embeddings

Numbers → meaning vectors

Each token ID is looked up in a giant matrix and becomes a vector — typically 4,096 to 16,384 numbers. Positions in that high-dimensional space encode meaning: 'king' and 'queen' end up near each other, and far from 'sandwich'.

The

dim 4096

cat

dim 4096

sat

dim 4096

the

dim 4096

mat

dim 4096

Self-Attention

Every token reads every other token

The Transformer's core trick. For each token, the model computes how much attention to pay to every previous token. That's how it knows which 'it' refers to which 'cat', or that a function argument relates to a return type 200 lines away.

	The	cat	sat	on	the	mat
The	0.90	0.05
cat	0.15	0.70	0.05
sat	0.08	0.55	0.25
on	0.05	0.35	0.30	0.20	0.05	0.05
the	0.05	0.25	0.10	0.20	0.30	0.10
mat		0.62	0.05	0.05		0.20

attention heatmap · row = current token · column = what it looks at

Sampling

A probability distribution → one word

The final layer produces a probability for every token in the vocabulary. The model either picks the most likely (greedy), or samples with a temperature setting that controls how 'creative' it is. Then it loops — that one new token becomes input to the next prediction.

mat

42%

floor

21%

chair

13%

roof

→ next token: "mat" · loop and continue

Chapter 04 · Deep Dive

Words Become Coordinates

Inside a model, every word, sentence, image, and snippet of code is a point in a high-dimensional space. Similar things land near each other. Most of what makes AI feel smart is just clever geometry on those points.

embedding space · 2D projection of ~3,072 dimensions

kingqueenprincecrownmanwomandogcatlionwolfcodeserveralgorithmcompilerparistokyoberlinlondonbreadpizzasushipasta

royalty

animals

tech

cities

food

vector arithmetic

king−man+woman≈queen

The classic Word2Vec demonstration. The direction from man to king is roughly the same as the direction from woman to queen — the model has learned an axis for "royalty" without ever being told the word.

cosine similarity · −1 to 1

king ↔ queen

0.86

king ↔ pizza

0.07

dog ↔ wolf

0.74

paris ↔ tokyo

0.62

code ↔ compiler

0.81

Hundreds of dimensions

Real embeddings live in 1,024–3,072 dimensions, not two. Each axis encodes some learned aspect of meaning — gender, formality, animacy, intent. We can only draw two of them, but the model uses all of them at once.

Distance = meaning

Cosine similarity between two vectors is how a model judges relatedness. Nearest-neighbor search over millions of embeddings is how RAG, semantic search, recommendation, and de-duplication all work under the hood.

Same trick for everything

Embed images, audio, code, even DNA — the same vector space lets you search across modalities. CLIP famously embedded text and pictures into a shared space, which is why you can search photos with a sentence.

Chapter 05

How the Model Became Smart

A frontier LLM is not one model — it is four trained on top of each other. Most of the cost lives in the first stage; most of the personality lives in the last three.

Pretraining

A trillion tokens of the internet

Predict the next word — over and over — across books, code, papers, and web pages. After ~10²⁵ FLOPs the model picks up grammar, facts, reasoning patterns, and the structure of dozens of languages without ever being told what any of them are.

~15T tokens

~6 months on 25,000 GPUs

Supervised fine-tuning

Show, don’t tell

Hand-written instruction → response pairs teach the base model what a helpful answer looks like. Now it stops auto-completing the prompt and starts addressing it.

~100K – 1M pairs

humans + curated demonstrations

RLHF

Humans rank, the model learns the ranking

For each prompt, generate two answers. Ask a human which is better. Train a reward model on those preferences, then use reinforcement learning to push the LLM toward higher-rewarded outputs.

preference data

reward model + PPO / DPO

Constitutional / RLAIF

The model critiques itself

Replace most of the human raters with another AI guided by a written constitution — a list of principles the model should respect. Faster, cheaper, and the rules are auditable text instead of a frozen reward model.

+ written principles

Anthropic’s approach

pretraining lossloss vs tokens seen

Loss falls fast in the first trillion tokens, then slows to a grind. The last few percent of capability cost more compute than everything before them.

RLHF · one preference round

response A

Sure! Here's a list of three reasons, with citations and a short summary at the end.

▲ chosen

response B

Yeah whatever, here you go.

▼ rejected

reward model

+0.78

Multiply this round by millions of preference comparisons and you have a numerical model of taste — strong enough to steer a 100B+ parameter network.

Scale vs alignment

Bigger pretraining gives a smarter base model — but a smarter model with no alignment is a worse product, not a better one. The frontier is in the post-training, not the parameter count.

Helpfulness vs safety

Push too hard for helpful and the model will help with anything. Push too hard for safe and it refuses to write a poem. Modern training treats this as a multi-objective optimisation, not a single dial.

Why models still hallucinate

Pretraining rewards plausible text, not true text. Post-training reduces hallucinations but never eliminates them — the model still has no built-in notion of uncertainty unless it was trained explicitly to express one.

Chapter 06

Thinking, Out Loud

A reasoning model is a regular LLM that has been trained to generate a long internal monologue before its final answer. It buys correctness with tokens — and sometimes that trade is the only one that works.

vanilla LLM0.4s

prompt: Three boxes are each mislabeled. One has apples, one oranges, one mixed. How many fruits must you draw to label them all correctly?

answer

Two — one from each labeled box.

wrong

reasoning model~12s · 4,200 thinking tokens

prompt: Three boxes are each mislabeled. One has apples, one oranges, one mixed. How many fruits must you draw to label them all correctly?

thinking trace

›Let me re-read the puzzle.
›There are 3 boxes. Each is mislabeled.
›I pick from the box labeled "mixed".
›If I draw an apple, that box must be "apples".
›Then the "apples" label is wrong, so it’s "oranges".
›And the "oranges" label is wrong, so it’s "mixed".
›One draw is enough. Verifying… consistent.

answer

One. Drawing from the box labeled "mixed" is enough.

correct

tree of thoughtexplore · prune · commit

Some training recipes encourage the model to fan out — try several short hypotheses, evaluate each, prune the bad ones, then commit. The visible answer is the survivor of a tournament the user never sees.

exploredkeptcommitted

Where reasoning helps

Math, formal logic, multi-step coding, debugging, planning. Anything where one wrong sub-step poisons the rest of the answer benefits from being able to backtrack.

The latency tax

A reasoning model can spend 10–60 seconds (and 5–20× the tokens) before its first visible output. Worth it for a hard answer; pure overhead for "what time is it in Paris".

The 2026 lineup

OpenAI o-series, Claude with extended thinking, Gemini Thinking, DeepSeek R1, Qwen QwQ. Each gives you a knob for how long the model is allowed to deliberate.

Chapter 07

Functions, Tools & Agents

A model on its own only knows what was in its training data. To do anything useful in the real world — read a database, send an invoice, search the web today — it needs to call code. That mechanism is called function calling (or tool use), and it's the difference between a chatbot and an agent.

⌖

web_search()fetch live information

∑

calculator()arithmetic, units, finance

⌥

sql_query()read your database

✉

send_email()trigger notifications

◊

create_invoice()business actions

⟳

browser_use()click, type, navigate

⟶ tool-use loop

User01

"Email John the Q3 numbers"

LLM02

I need data, then to send mail

sql_query03

SELECT revenue FROM q3

LLM04

Got $4.2M. Compose email.

send_email05

to: john@…

Done06

Sent. Confirmed to user.

The loop is simple: the model emits a structured request to call a function, your runtime executes the actual code, the result is handed back as a new message, and the model decides what to do next. The cycle continues — sometimes for dozens of steps — until there's nothing left to call. That's an agent.

you define the tool

{
  "name": "get_invoice",
  "description": "Fetch an invoice by id",
  "input_schema": {
    "type": "object",
    "properties": {
      "invoice_id": { "type": "string" }
    },
    "required": ["invoice_id"]
  }
}

the model emits this

{
  "type": "tool_use",
  "name": "get_invoice",
  "input": {
    "invoice_id": "INV-2026-04-118"
  }
}

Your code receives this, calls the real API, and returns the result. The model continues from there.

CHATBOT

No tools

Replies from training data only. Can be brilliant at language, useless at facts that change after the cutoff date.

AGENT

Tools + loop

Reads your CRM, runs SQL, sends Slack messages, schedules a call. Capability scales with the toolbelt you give it.

Chapter 08 · Practice

Prompts Are an Interface

Most of the gap between a useless answer and a useful one is in the prompt, not the model. The good news: prompting follows recognisable patterns, and almost all of them are about being specific in the right places.

anatomy of a working promptstack from outside in

system

You are a senior code reviewer. Be terse. Always cite line numbers.

sets the persona, tone and rules of engagement

context

Repo: payments-api · file: charge.ts · diff: +42/−7

the data the model needs but didn’t train on

instruction

Review for race conditions, missing null checks, and currency rounding bugs.

the actual ask · one verb is usually enough

examples

Bad → "looks fine" Good → "L42: idempotency key not validated, retries can double-charge"

few-shot guides shape and quality at once

output_format

{"issues": [{"severity": "high|med|low", "line": int, "fix": string}]}

lock down the shape so downstream code can parse it

Zero-shot → Few-shot

before

Categorise this support email.

after

Examples:
  "card declined" → billing
  "won’t install" → onboarding
  "still slow"   → performance

Now: "{message}" →

before

42%

after

84%

Vague → Role-primed

before

Make this paragraph better.

after

You are an editor at The Economist. Cut filler. Replace abstract nouns with concrete verbs. Keep length within ±10%.

[paragraph]

before

38%

after

78%

Freeform → JSON schema

before

Extract the date and amount.

after

Return only valid JSON matching:
{"date": "YYYY-MM-DD", "amount_usd": number}
If either is missing, return null for that field.

before

55%

after

96%

Treat prompts like code. Version them, test them on a real eval set, and never edit a production prompt without a diff.

Chapter 09

Beyond Text

Text was just the first modality to fall. Today’s frontier models take any combination of words, pictures, sound, and video — and handle them as different views of the same shared embedding space.

four streams · one shared space

TEXT

words → tokens

IMAGE

16×16 patches → tokens

AUDIO

spectrogram strips → tokens

VIDEO

frames + time → tokens

shared space

unified tokens

frontier models · 2026 · what they accept natively

model	TEXT	IMAGE	AUDIO	VIDEO
Claude 4.x	●	●	–	–
GPT · o-series	●	●	●	●
Gemini 2.x	●	●	●	●
Llama 4	●	●	–	–
Qwen 3	●	●	●	●

Document understanding

Drop a 100-page contract PDF in. The model treats every page as an image, every paragraph as text, and answers questions across both — no OCR pipeline required.

Voice mode

Speech-in, speech-out, end to end. The same network plans the answer and shapes the prosody. Latencies have dropped from seconds to ~300 ms.

Video Q&A

Sample a few hundred frames + the audio track, embed them in the same space as text, and the model can answer "at what minute does the speaker contradict herself?".

Image generation

A diffusion or autoregressive head turns the same shared embeddings back into pixels. Edit a photo by describing the change in plain English.

Chapter 10

AI in the Real World

Tool use isn't hypothetical. Right now, AI models are folding proteins, designing entirely new molecules, controlling fusion reactors, and forecasting the weather better than the supercomputers they replaced. The most consequential application — and the most expensive problem AI has ever been pointed at — is the discovery of new medicines.

$2.6B

Cost per approved drug

10–15 yr

Discovery → market

90%

Clinical-trial failure rate

10⁶⁰

Drug-like molecules possible

⟶ traditional pipeline · attrition by stage

The 10,000-to-1 funnel

10,000

Candidate compounds

250

Hits in screening

Lead molecules

Preclinical

Approved drug

Of every 10,000 candidate molecules a chemist synthesises, roughly one survives clinical trials. AI is reshaping every stage of this funnel — narrowing the search space, designing molecules that don't exist yet, and predicting failure before a single test tube is filled.

How scientists actually use AI

Five concrete stages where models are now part of the lab — not as assistants, but as the engine doing the work.

Target identification

Which protein, mutation, or pathway should the drug attack?

Models read every paper, patent, and trial registry ever published, plus genomic and proteomic data from millions of patients. They build a knowledge graph of disease causality — and rank the proteins most likely to be druggable. Insilico's PandaOmics and BenevolentAI's graph engine do exactly this.

diseasecandidate targetranked top hitevidence (paper · patent · trial · omics)

Protein structure prediction

From amino-acid sequence to a 3D shape you can dock molecules into

A protein's function is determined by how it folds — and folding was an unsolved problem for 50 years. AlphaFold 2 (2020) and AlphaFold 3 (2024) collapsed it. Predictions accurate to within an atom, in seconds, for any sequence on Earth. The full structural proteome — over 200 million proteins — is now public.

seq: MKTAYIAKQRQISFVKSHFSRQLEERLG…→ 3D structure

Generative chemistry

Design new molecules that don't exist yet

Diffusion models and graph VAEs trained on tens of millions of known compounds learn the latent space of valid chemistry. Given a binding pocket, they generate novel molecules optimised for affinity, solubility, and synthesisability — searching a space of 10⁶⁰ possible drugs that no human team could enumerate.

generated · sample 0341 / 12,800

Binding affinity82%

Solubility71%

Synthesisability64%

Toxicity18%

Each generated candidate is scored on multiple objectives in parallel — the model learns to optimise all of them at once.

Automated wet labs · phenomics

Robots run the experiments, vision models read the results

Recursion's labs image millions of human cells per week, perturbed by thousands of compounds and gene knockouts. Self-supervised CNNs convert each image into an embedding — phenotypes that look the same end up in the same neighbourhood, revealing molecules that 'rescue' diseased cells without anyone needing to know the mechanism.

cell-imaging plate · 48 wells showncluster: rescue phenotype

confirmed rescueneighbour in embeddingbaseline / no effect

Clinical-trial acceleration

Predict failure before the patient enrolment opens

Models trained on decades of historical trials predict which compounds will fail Phase 2 toxicity, suggest patient-stratification cohorts, and even generate digital twin control arms. Less wasted compute, less wasted years, less wasted human risk.

MOL-091

18%pass

MOL-092

74%flag

MOL-093

92%fail

MOL-094

32%pass

MOL-095

61%flag

Predicted Phase-2 failure risk · trained on 30+ years of historical trial outcomes. Flagged compounds are re-engineered before a single patient is recruited.

The toolbelt of a 2026 computational scientist

Six platforms doing the heaviest lifting today. Some are open weights you can run on a workstation; some are commercial pipelines worth multi-billion-dollar deals.

AlphaFold 3

structure

Google DeepMind / Isomorphic

Predicts the 3D structure of proteins, DNA, RNA, and ligand complexes from sequence alone. Solved 200M+ structures publicly.

RFdiffusion

design

Baker Lab, U. of Washington

A diffusion model that designs entirely new proteins from scratch — binders, enzymes, scaffolds. Won the 2024 Nobel in Chemistry.

Boltz-1 / Chai-1

docking

MIT · Chai Discovery

Open-weights successors to AlphaFold for protein–ligand docking. Lab-runnable, no API gatekeeping.

GNoME

materials

Google DeepMind

2.2 million new crystal structures predicted — 380,000 stable. A 800-year leap in materials science in one model.

Pharma.AI

pipeline

Insilico Medicine

End-to-end pipeline: target discovery (PandaOmics) + generative chemistry (Chemistry42). First AI-designed drug now in Phase 2.

Recursion OS

phenomics

Recursion Pharmaceuticals

Robotic labs run millions of cell-imaging experiments per week; CNNs cluster phenotypes to find drug candidates by visual similarity.

Companies already shipping

Not research papers — actual molecules in actual humans, or partnerships where Big Pharma is paying real money for AI-designed candidates.

Isomorphic Labs

Alphabet · DeepMind spin-out

$3B+

partnership value

AlphaFold-powered drug design

Founded 2021. Partnerships with Eli Lilly and Novartis worth $3B+ in milestones. Uses AlphaFold 3 to model how candidate molecules interact with disease-causing proteins — collapsing months of crystallography into minutes.

Insilico Medicine

Hong Kong · NYC

30 mo.

discovery → clinic

First end-to-end AI-designed drug in human trials

INS018_055 — a treatment for idiopathic pulmonary fibrosis (IPF) — was discovered, designed, and brought to Phase 1 in under 30 months for ~$3M. Now in Phase 2 trials, the first drug where both the target and the molecule came from AI.

Recursion

Salt Lake City · NASDAQ: RXRX

~50M

experiments / week

Phenomics + automated wet labs

Robotics image millions of human cells under thousands of perturbations every week. Self-supervised vision models cluster phenotypes; matches reveal which molecules rescue diseased cells. 10+ programs in or near the clinic.

BenevolentAI

London · LSE: BAI

1B+

graph relations

Knowledge-graph reasoning over biomedical literature

A graph of 1B+ relationships from papers, patents, and clinical data. In 2020 their model proposed baricitinib for COVID-19 within 48 hours; the FDA later approved it. Now applied to ALS, ulcerative colitis, and chronic kidney disease.

Beyond medicine — six other frontiers

The same recipe — train a large model on a domain's data, then let it generate or predict what experiments would have taken decades to find. It works almost everywhere it's been tried.

Weather

GraphCast · Aurora

A graph neural network that beats the European supercomputer model on 10-day forecasts — and runs in under a minute on a single TPU instead of hours on a cluster.

Fusion

DeepMind × EPFL

Reinforcement learning controls the magnetic coils of a tokamak in real time, holding plasma in shapes humans never managed to stabilise — a step toward commercial fusion.

Mathematics

AlphaProof · AlphaGeometry 2

Solved 4 of 6 problems at the 2024 International Math Olympiad — silver-medal performance. Geometry was solved by combining a language model with a symbolic deduction engine.

Astronomy

LIGO · Vera Rubin pipelines

CNNs scan gravitational-wave streams for black-hole mergers in real time, and triage tens of millions of nightly transient detections from new sky surveys.

Climate

NeuralGCM

Hybrid neural / physics climate model from Google. Atmospheric simulations 100,000× cheaper than the legacy spectral solvers used by national weather services.

Robotics

RT-2 · Optimus · Figure

Vision-language-action models map a camera frame and a sentence (“pick up the red mug”) directly to motor torques — a generalist policy instead of bespoke per-task code.

None of these systems are general intelligence. They are narrow, domain-specific function approximators trained on data nobody could sift through manually. The shift is that the bottleneck in science used to be human imagination over a tiny search space; the bottleneck is now wet-lab validation of a search space the models can canvas in an afternoon.

Chapter 11

How "Smart" Gets Measured

Every model release ships with a battery of benchmark scores. They're a microscope, not a mirror — useful for comparing neighbours, dangerous if you confuse them with the territory. Numbers below are illustrative of where the frontier sits in 2026.

MMLU

/ 100

undergraduate-level general knowledge across 57 subjects

Claude Opus 4.x

GPT-5

Gemini 2.x Pro

Llama 4

DeepSeek V3

HumanEval

/ 100

164 Python coding problems with hidden unit tests

Claude Opus 4.x

GPT-5

Gemini 2.x Pro

Qwen 3-Coder

DeepSeek V3

GPQA Diamond

/ 100

PhD-level multiple choice in biology, physics, chemistry

o-series reasoning

Claude (extended thinking)

Gemini 2.x Thinking

DeepSeek R1

Vanilla Claude Sonnet

SWE-Bench Verified

/ 100

real GitHub issues — patch must pass the existing test suite

Claude Opus 4.x · agent

GPT-5 · agent

Gemini 2.x · agent

Qwen 3-Coder · agent

Best 2024 model

A benchmark is a microscope, not a mirror.

The only number that matters is how the model performs on your data and your task. Build a 50-example eval set on day one — every model decision after that gets easier.

What each one measures

MMLU = knowledge. HumanEval = isolated coding. GPQA = reasoning under uncertainty. SWE-Bench = doing real engineering work end to end. None of them measures whether the model is actually useful for your job.

The saturation problem

Top models now sit within a few points of each other on MMLU and HumanEval. The benchmarks have stopped discriminating — newer ones like FrontierMath, ARC-AGI-2, and SWE-Bench Verified are taking their place.

The gap to real work

A model that scores 95 on HumanEval can still fail at fixing your bug. Synthetic tasks reward narrow skill; production code rewards reading 10 files, running tests, and arguing with the linter.

Chapter 12

The 2026 Model Landscape

Eight families dominate production traffic. Half are closed APIs run by their creators; half are open weights you can download, fine-tune, and host yourself. Pick by task, cost, and where the data is allowed to live.

Anthropic

Claude

closed

Opus 4.x · Sonnet 4.x

context

200K – 1M tokens

hosting

cloud

long-context reasoningcodingtool usesafety

Constitutional AI alignment. Strong at multi-step agent loops.

OpenAI

GPT / o-series

closed

GPT-4.x · o3 · o4

context

128K – 1M tokens

hosting

cloud

general purposereasoningvoiceimages

Reasoning variants spend inference compute on chain-of-thought.

Google

Gemini

closed

Gemini 2.x Pro / Flash

context

up to 2M tokens

hosting

cloud

multimodalhuge contextnative video

Tightly integrated with Google services. Strong on image + video.

Llama

open weights

Llama 3.x · 4

context

128K tokens

hosting

cloud or local

open weightsfine-tunablestrong base

The de-facto open foundation. Runs on your hardware if you have the RAM.

Mistral

Mistral / Mixtral

open weights

Large 2 · Mixtral 8×22B

context

32K – 128K tokens

hosting

cloud or local

MoE efficiencymultilingualcompact

European, Apache-licensed. Mixture-of-experts gives big-model quality at small-model cost.

xAI

Grok

closed

Grok 3 / 4

context

128K+ tokens

hosting

cloud

real-time datalong reasoning

Tight integration with X. Trained on a very large compute cluster.

DeepSeek

open weights

V3 · R1

context

128K tokens

hosting

cloud or local

costreasoningopen R1 weights

Open reasoning model that rattled the market in early 2025.

Alibaba

Qwen

open weights

Qwen 3 / 3-Coder

context

128K – 1M tokens

hosting

cloud or local

multilingualcodingsmall + large variants

Extremely strong open family across many sizes. Great at Asian languages.

Benchmarks shift week to week and rarely match real-world performance on your task. The right answer is usually: pick two candidates from different vendors, build the same eval set on your own data, and let the numbers decide.

Chapter 13

The Economics of Inference

A frontier model costs ~400× as much per token as a local one. The hard part of building with AI is no longer "can it do this?" — it's "which tier should I be paying for, and where?".

price ladder · USD per 1M tokensillustrative · 2026

FrontierOpus 4.x · GPT-5 · Gemini 2.x Pro

$15 in / $75 out

MidSonnet 4.x · GPT-5 mini · Gemini Flash

$3 in / $15 out

CheapHaiku · GPT-5 nano · open weights

$0.25 in / $1.25 out

LocalLlama 4 / Qwen 3 · Mac Studio / RTX

~$0 · electricity

Output tokens cost 4–5× input tokens because they're generated one at a time, with full GPU memory pressure each step. That's why "answer in JSON, not prose" can quietly halve your bill.

same task · different tier

Summarise 100 PDFs · ~1.2M in / 200K out

Frontier

$18.40

Mid

$4.10

Cheap

$0.42

Local

$0.05

The frontier tier is ~370× the local tier on paper. Whether it's worth that gap depends entirely on whether you can tell the difference in the output.

scaling laws · the Chinchilla insightlog-log

DeepMind's 2022 result: at any given compute budget, the best model is smaller than people thought, but trained on far more tokens. The race for parameter count was partly a misallocation — and that's why a well-trained 70B can beat an under-trained 500B.

Pick a tier per task, not per app

Classification, extraction, simple summaries — the cheap tier is enough. Save the frontier for the steps that actually need it: planning, debugging, novel reasoning.

Caching is a 60–90% discount

Provider-side prompt caching reuses the system prompt and long context across calls. For a stable agent loop, the difference is measured in orders of magnitude on the bill.

Local cost ≈ electricity

Once the GPU is bought, running a local 70B model is a few cents per million tokens — but only if you have steady throughput to amortise the hardware.

Chapter 14

Local vs Cloud Models

The single most consequential architectural decision in any AI system: does the model run inside your infrastructure, or do you send every prompt to someone else's GPUs? Both are valid — they trade different things.

CLOUD

GPT, Claude, Gemini, Grok

+ advantages

Frontier-class quality (200B+ effective parameters)

No hardware investment — pay per token

Always-on updates, multimodal, voice

Handles spikes elastically

− trade-offs

Data leaves your perimeter

Per-token cost compounds at scale

Latency tied to network + provider

Vendor lock-in & regulatory questions

LOCAL

Llama, Qwen, Mistral, DeepSeek

+ advantages

Data never leaves your infrastructure

Predictable cost — only electricity

Air-gapped deployment possible

Full control over fine-tuning & versioning

− trade-offs

Quality ceiling at 70B–120B class

Significant hardware investment

Inference engineering on you

No automatic upgrades

LocalCloud

Privacydata stays on premisedata sent to provider

Quality~2023 frontier (70B–120B)2026 frontier (closed)

Cost shapecapex (GPUs)opex (per token)

Latency<50ms local200ms+ network round trip

Complianceeasier (HIPAA, GDPR, on-prem)depends on provider DPA

Updatesyou re-pull weightsautomatic

Multimodallimited (image/audio LLMs grow)native voice + video

Scalingadd GPUs → linearelastic, instant

The hybrid answer

Most production systems we build at Vorcl are hybrid: frontier closed models do the hardest reasoning, a fine-tuned local model handles the high-volume, sensitive, or repetitive work, and a router decides per request. Privacy and cost stay bounded; quality stays at the ceiling where it matters.

Reference

Twenty Words That Cover the Field

A short, pin-this-to-the-fridge glossary. If you remember these, you can read almost any AI paper, blog post, or release note without getting lost.

token

A chunk of text the model sees as one unit. ~4 chars on average. ~50K of them in the vocabulary.

parameter

A learned number inside the network. Frontier models have 100B – 2T of them.

context window

How much the model can read at once, measured in tokens. 200K is comfortable; 1M is the new ceiling.

embedding

A list of numbers that represents meaning. Words, images, and audio can all become embeddings.

attention

The Transformer trick that lets every token decide how much each other token matters.

transformer

The neural net architecture, introduced in 2017, that almost every modern LLM is built on.

fine-tuning

Continuing training on a smaller, task-specific dataset to nudge a base model toward a behaviour.

RLHF

Reinforcement Learning from Human Feedback. Humans rank outputs; the model learns to prefer the winners.

temperature

A sampling knob. 0 = always pick the most likely token. Higher = more creative, more chaotic.

top-p

Nucleus sampling. Keep only tokens whose cumulative probability sums to p, then pick from those.

hallucination

When a model confidently states something untrue. A side effect of being trained to sound right, not be right.

RAG

Retrieval-Augmented Generation. Look up relevant docs first, paste them into the prompt, then answer.

agent

An LLM in a loop that can call tools, observe their output, and decide what to do next.

tool use

The mechanism that lets a model emit a JSON function call instead of free-form text.

MCP

Model Context Protocol. A standard for letting any model connect to any tool or data source.

reasoning model

A model trained to generate a long internal chain of thought before its final answer.

multimodal

Accepts more than one kind of input — text, images, audio, video — and reasons across them.

quantization

Compressing model weights from 16-bit floats to 8 or 4 bits. Smaller, faster, slightly dumber.

MoE

Mixture of Experts. Only a subset of parameters fire per token, so a 400B model runs like a 40B.

distillation

Training a small model to imitate a big one. Cheap inference, most of the quality.

End of lesson

Now Put It to Work.

You've seen how the model works. The hard part is choosing the right one for your data, wiring it into your stack, and making it safe to put in front of customers. That's the job we do.

Read the Laboratory →