The Channel Between Them

The von Neumann bottleneck, and why AI is arriving at it from the other direction.
June 2026

For most of the last decade, the recipe for a better model was a bigger one: more parameters, more data, more compute, lower loss. It worked well enough to become the whole strategy, and it carried an assumption nobody had to defend. That a model should hold everything it knows inside itself, reasoning and facts together, in one set of weights.

We made that choice once before, in hardware, and it took fifty years to work back out of it.

In 1945 John von Neumann circulated a First Draft of a Report on the EDVAC. Its idea, the stored program, was to keep a machine's instructions and its data in the same memory, so you could change what the machine did by loading new words instead of rewiring it by hand. Every computer since has inherited the arrangement underneath it: a part that computes, a part that holds what it works on, and a single channel that carries data between them.

Title page of von Neumann's First Draft of a Report on the EDVAC, June 1945 — Samuel N. Alexander's copy.
First Draft title page — image unavailable.
John von Neumann, First Draft of a Report on the EDVAC, June 30, 1945 — Samuel N. Alexander's copy, the basis for the NBS SEAC. Internet Archive scan via Wikimedia Commons (public domain).
The EDVAC computer installed at the U.S. Army Ballistic Research Laboratory, with two operators.
EDVAC photograph — image unavailable.
The machine the report described: the EDVAC in Building 328 at the U.S. Army's Ballistic Research Laboratory, Aberdeen Proving Ground, delivered 1949. U.S. Army photo, via Wikimedia Commons (public domain).

That channel is the catch. Follow one instruction through. The processor puts an address on the channel and waits for the instruction to come back. It reads the instruction, finds it needs a number, puts another address on the same channel, and waits again. It computes, then sends the result back across the channel. Instructions and data travel the same channel, one word at a time, and the processor can do nothing until the word it asked for arrives.

THE VON NEUMANN MACHINE, 1945 Computes fast — but idles whenever the next word it needs hasn't arrived yet. PROCESSOR computes, fast Holds both the program and the data it works on, together in one store. MEMORY program + data, in one store The single channel. Every instruction and every byte crosses it, one word at a time — the bottleneck. THE BUS address out one word back Every cycle crosses it: ① fetch the instruction → ② fetch its data → ③ write the result The processor idles whenever the bus is busy. As processors outran memory, that wait — not the arithmetic — set the speed of the machine. Backus named it the von Neumann bottleneck in 1977.

This was fine while processors and memory ran at similar speeds. They did not stay that way. Through the 1990s processors got faster by roughly fifty percent a year while the time to reach memory improved by about seven, so a fast chip spent more and more of its life waiting; engineers called the widening gap the memory wall. John Backus had already named the deeper problem in his 1977 Turing Award lecture: the von Neumann bottleneck, the channel that throttles the whole machine and forces the world through it one word at a time. The answer was never a faster adder. It was a hierarchy of memory built to hide the distance to the data.

I think AI is now meeting the same wall from the other side.

Where we actually are

It is tempting to say AI is still chasing scale, but that stopped being true a while ago. The single dial broke into several, and most of them have already been turned. Reinforcement learning on problems with checkable answers taught models to reason in long chains and to use tools, which is what turns a chatbot into an agent that can run on its own for minutes or hours.1 Test-time compute lets a model spend more thought on a hard problem instead of answering in one pass. KV caches and longer context windows made those long runs affordable, by holding the work so far in fast memory rather than recomputing it from scratch. By 2026 the frontier is not a single number. It is an allocation problem: for a given task and budget, how much to spend on pre-training, on reinforcement learning, on thinking longer, and on the one lever still mostly untouched: what the model can read that it never learned.

FOUR LEVERS, NOT ONE Pre-training: buys broad competence. Slow and capital-heavy — it moves about once a generation. PRE-TRAINING buys broad priors limited by cost; moves once a gen. Reinforcement learning: buys skills where the answer can be checked — math, code, agent tasks. Stalls where correctness is hard to verify. RL buys checkable skills limited by verifiable answers Test-time compute: buys depth by thinking longer at inference. No help when the answer has to come back now. TEST-TIME buys depth limited by latency Retrieval: buys fresh knowledge from an external store. The under-pulled lever — limited mainly by production latency and cost. RETRIEVAL buys fresh knowledge limited by production infrastructure The frontier question is allocation across the four, per task and per cost, not the size of any one.

The lever still on the table

That last lever is retrieval, and it behaves unlike the others. Pre-training is slow and expensive and moves about once a generation. Reinforcement learning and longer thinking are fast, and they own the next year of visible gains. Retrieval is the one you can change continuously: refresh the store while the model sleeps and its knowledge updates without touching a single weight.

It also scales differently. Adding training compute follows a power law — each tenfold increase buys about the same drop in loss, at ten times the cost — and the supply of fresh, high-quality text it runs on is starting to run out.23 Growing the store a model reads from is cheaper, and in the one careful study of it the returns had not run out: as the datastore grew toward a trillion tokens, loss kept falling with no saturation in the range they could test, and a small model reading from a large store beat a much larger one that had to remember everything.4 That result comes with an asterisk worth saying out loud: it is measured on knowledge-heavy question answering, the work retrieval most naturally helps, not on hard multi-step reasoning. The headline is real; its scope is narrower than it sounds.

That changes what knowledge is. Baked into weights, it is a stock, fixed the day training ends. Held in a store you can read, it is a flow. One is a capital expense that starts depreciating the moment the run finishes; the other is an operating cost that stays current.

And the store is not one thing. It can be the files on a laptop, a company's private corpus, a model's own running memory, or the open web — the largest of them, humanity's continually updating record. The reasoning-versus-memory split holds whichever it is, which is what makes it more than a pitch for any one index. The kind of store does matter when you lean on a result, though: the trillion-token study above used a fixed, curated corpus, and "no saturation yet" there is no promise that pointing a model at the open, adversarial web makes it smarter. A bounded library and the live internet are different animals.

The same problem, older names

None of this is new. The gap between a fast part and a large part, and the work of managing the channel between them, is most of computing history under other names. Each version already holds the answer.

The memory wall, again. The cure was the cache: keep the most-used words beside the processor so it rarely waits for the slow trip to main memory. In an LLM the context window is that cache, text pulled in close so the model does not have to reach further for it, and the KV cache is a literal one. When the right thing is not in the cache a processor stalls; a model, missing the fact, guesses. Not every hallucination is a cache miss — a model also misstates things it does hold, and a retriever can serve something stale or contradictory — but the missing-fact kind is the one a store is built to fix.

Virtual memory. The Atlas machine in 1962 ran programs too large for its memory by paging pieces in from a drum as they were needed and keeping only the active set close. It worked because a program uses a small working set at a time; page in too much and the machine thrashes. Retrieval is paging, and the lesson carries over intact: fetch the few passages that matter. Stuffing the context with everything within reach makes the model slower and worse, not better, because attention cost grows with the square of what you put in, and the few useful tokens get lost among the rest.

STUFF THE CONTEXT, OR FETCH WHAT YOU NEED answer quality tokens loaded into context (log) → Dump everything into the context: quality rises, then falls — attention cost grows with the square of length, and the few useful tokens get buried. DUMP EVERYTHING IN cost ∝ n²; the signal drowns Retrieve the right few passages: quality climbs and holds, because the signal stays dense. RETRIEVE THE RIGHT FEW fetch the passages that matter Schematic. Past a point, loading more into the context window costs more — attention scales with the square of length — and buries the few useful tokens (the “lost in the middle” effect). Retrieving the right passages keeps the signal dense. An inference-time choice, not a verdict on pre-training.

The query optimizer. By the late 1970s data had outgrown the ability to scan it, and IBM's System R learned to choose the cheapest path to an answer on its own; you said what you wanted, not how to fetch it. This is the part AI has barely built. Something has to decide whether to answer from the weights or go to the store, which store, how many passages to pull, and when to stop. That decision is a query planner for knowledge, and today it is crude, a fixed number of nearest neighbors by similarity. History is blunt about what comes next: the system that plans the fetch well beats the one with the bigger store.

The bandwidth wall, now. A modern GPU can do far more arithmetic than it can feed itself; attention is limited by memory traffic rather than math, which is why FlashAttention won in 2022 by cutting reads and writes instead of operations. One more trick about the channel.

THE SAME PROBLEM, OLDER NAMES the fix is always in the channel. ERA THE FIX, BACK THEN THE SAME FIX, IN AI Memory wall: processors outran memory; caches and prefetch hid the trip. In an LLM the context window and KV cache are that cache — and a cache miss looks like a hallucination. Virtual memory: Atlas paged the working set in from a drum. RAG pages facts into the context; overstuff it and the model thrashes. Query optimizer: System R chose the cheapest fetch plan on its own. The AI version barely exists — deciding whether to reason or retrieve. Bandwidth wall: GPUs do more math than they can feed; FlashAttention cut memory traffic, not math. The fetch, not the arithmetic, is the cost. MEMORY WALL 1990s SRAM caches keep hot words an inch from the processor the context window is that cache; a miss can read as a hallucination VIRTUAL MEMORY Atlas, 1962 page in the working set; page in too much and it thrashes RAG pages passages into context; a stuffed prompt thrashes too QUERY PLANNER System R, 1979 the database picks the fetch plan; you say what, not how answer from weights, or go fetch? the planner AI hasn't built yet BANDWIDTH WALL now FlashAttention: fewer memory reads, not less math the GPU starves between fetches; the trip, not the math, is the cost Six decades, one lesson: the engineering goes into the channel, not the parts it connects.

That last one is worth pinning down, because it is easy to charge retrieval to the wrong account. Inference is already bandwidth-bound at the level closest to the chip: moving the weights and the cache in and out of fast memory dominates each step, and a reasoning model multiplies that traffic by thinking for many. Retrieval lives on a different, slower channel — a network or disk trip, milliseconds rather than nanoseconds — so a fetch on the path of every step does not crowd that inner budget; it adds an outer bottleneck on top of it. These are two channels at two scales, and what carries across them is the lesson, not the budget. Still, the outer trip is real cost, so the cheap version wins for now: pasting fetched text into the prompt, ordinary retrieval-augmented generation, works well enough that the deeper designs, proven in research years ago,5 keep losing on latency and stay unshipped.

There is a twist, though, and it is the same trick that beat the memory wall. A processor hides a slow fetch by guessing what it will need and overlapping the trip with work it can already do. An agent can do exactly that: predict the next lookup, fetch while it reasons, reuse the result across many steps with prompt caching making the re-reads cheap. The latency objection bites hardest on a single quick answer and nearly vanishes on a long agent run — which means deep retrieval looks worst precisely where the value is lowest, and best where it is highest.

Knowledge leaves the weights

Andrej Karpathy has a clean way to see what we have built. The LLM, he argues, is becoming the CPU of a new kind of computer: the model is the processor, the context window is its RAM, tools are its peripherals. In that picture retrieval is the disk, and today the disk is barely there. Almost everything the model knows is held inside the processor, in the weights, the way a 1945 machine held its program and data fused in a single store. The disk it does have, ordinary retrieval, is a thin bolt-on. We pulled processor and memory apart in hardware decades ago and never looked back; we are only now, and only partway, doing the same for models.

The cost of leaving knowledge in the weights is the cost of memorizing a phone number instead of looking it up. The memorized number is instant and yours, until the day it changes and you dial a stranger without a flicker of doubt. Weights are the memorized number: fixed at training, blurring the rare case into the average, unable to say where a fact came from. A store is the lookup, half a second slower, current, and able to show its source.

There are two ways to put knowledge and computation in the same place: bring the facts to the model, or bring the model to the facts. Training does the second, once, at great cost, and freezes the result; retrieval does the first, continuously, and stays fresh. But the split worth making is not small model versus large store. The model may keep getting larger, and more reasoning may keep emerging from scale; betting against that is a good way to be wrong. The split is between two jobs one set of weights is forced to do at once — reasoning, and serving as the place facts are kept — and the waste is in conflating them. You can run the model for what it is good at, thinking, and push the high-churn factual layer out to a store you can refresh and audit. And you can route: match the size and cost of the model to the difficulty of the task, instead of paying frontier prices to recall a date a lookup would serve for a fraction of a cent. The payoff is already measurable. Give a model a dedicated memory component and a 1.3-billion-parameter version can approach a 7-billion-parameter one trained on roughly ten times the compute, on factual recall — most of that gap was the larger model spending parameters to memorize.4 How much of what a model knows should live outside it is the open question. That it is doing two jobs with one organ is not.

THE LLM, AS A COMPUTER TODAY Today the model holds nearly all its knowledge inside the processor; the disk it has is a thin bolt-on. WEIGHTS = CPU with all its knowledge crammed inside give it a disk THE DESIGN WE ALREADY USE EVERYWHERE ELSE The weights, as a CPU: where the reasoning runs (any size). REASONING CORE the weights = CPU runs the reasoning RAM — the hot working set the model can see right now. Small and expensive; cost grows with the square of what you load. CONTEXT WINDOW = RAM what it sees right now Disk — large, fresh, read on demand. Today only thinly used. KNOWLEDGE STORE = disk large, fresh, on demand retrieval After Karpathy: the model is the CPU, the context window is RAM. The disk — a store it reads on demand — is the part still barely built. Retrieval is the controller between them.

Set the two machines side by side. The processor is the part that reasons. The memory is the store of facts. The channel between them is retrieval. And the piece that decides what to fetch and when — which store, how much, when to stop, when to trust the weights instead — is what a computer calls a memory controller and a database calls a query planner: the part an AI system has barely named, though it is already doing the job. However large the reasoning side grows, a model that reads from a fresh, auditable store answers from current, sourced facts where one recalling everything from inside answers from a frozen, averaged copy. The value is not in either end. It is in the connection, and the thing that drives it.

That deciding layer is also why this round may not just rhyme with the last. When to retrieve, from where, how much, when to stop — that is a control problem with checkable outcomes, the regime reinforcement learning has lately begun to crack, so the training signal for it exists at least in principle. But it is a hard signal. Crediting a final answer back to one fetch buried in a long chain is the nasty end of credit assignment, and it collides with a finding of the field's own: that outcome-only rewards tend to generalize better than rewards shaped step by step — exactly the regime where scoring a single mid-trajectory decision is hardest. So the planner is learnable in principle, and building the reward that teaches it well is the live frontier, not a part left unbuilt out of inertia.

This is a bet, and it can lose, so here is the world where it does. The strongest case against pulling knowledge out of the model is the bitter lesson: general methods that ride raw compute tend to beat hand-built structure, and a retriever, with its ranking and gates and policy, is structure. Two moves would settle it against the split. Context windows could get cheap enough that reading the relevant slice of the world on every pass is just a feature of the model. Or memory layers could fold the store back inside the weights as a trained-in lookup at no extra compute, and the disk moves inside the case, value staying where the labs already sit. I put the disaggregation bet at maybe sixty-forty, not at destiny. For it: a retriever aimed at a live, changing corpus draws on far more than any frozen weights can hold, and freshness is not something you can pre-train. Against it: the bitter lesson has swallowed cleaner abstractions than this one.

Step back far enough and this looks less like a one-way march to disaggregation than like an oscillation. Computing keeps swinging between integrating its parts and pulling them apart, each swing chasing wherever the bottleneck sat — monoliths to microservices and part of the way back, separate memory to caches folded onto the die. Memory layers are not a refutation of the argument so much as its integrating swing. The durable claim is the smaller one: value collects at whatever the binding constraint is, and the constraint is migrating from the parts to the channel between them.

And one argument survives even if the efficiency bet loses outright. Suppose memory layers win, context gets cheap, and every margin for keeping knowledge outside the model disappears. A system anyone has to trust still has to say where a claim came from — which document, which date, which source — and a pile of weights cannot do that at any size. Provenance is not a capability you scale your way into; it is a property of keeping knowledge as something you can point at. The store may lose on speed and cost and still be the only place an answer can carry its receipts.

Von Neumann's lesson was not that logic and knowledge belong together. It was that once you pull them apart, the connection between them sets how fast the whole machine can run — and the rest of the system arranges itself around it. The first time we drew that line it shaped fifty years of computing: the memory hierarchy, the tower of abstractions, the whole habit of building to hide the distance to the data. How we program, what we optimize, where the hard engineering goes — all of it organized around one channel. For ten years we built models as if their version of that connection were free. It never was.

Value collects at the constraint, and right now the constraint is the connection — along with whatever learns to decide what crosses it. And a connection does more than move facts. Andy Clark argued that the mind reaches past the skull into the tools it thinks with: the notebook, the search bar, now the store a model reads from. What a system can reach for becomes part of what it knows, and then part of what it thinks. Which makes the deciding layer a good deal more than an efficiency. A flow you can control is a flow you can gate, meter, and shape, and whoever runs the planner draws the edges of what every system downstream of it can know — and, in time, what it can even think to ask. That is power, not plumbing. So the open question is not only whether the layer lives out where anyone can build on it or folds quietly back into the weights, but who gets to decide what crosses it. We are early enough that it is still ours to settle — which is the thing worth settling on purpose.

[1] The clearest demonstration that inference-time compute can substitute for parameters is Snell et al., Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters (2024), the line of work behind reasoning models that spend more compute on harder problems.

[2] The compute scaling law was set out in Kaplan et al., Scaling Laws for Neural Language Models (2020), and refined by the Chinchilla work, Hoffmann et al., Training Compute-Optimal Large Language Models (2022). The bend is now visible: Epoch AI estimates GPT-5 used less total training compute than GPT-4.5, the first generational decrease, attributing it to a shift toward post-training while expecting GPT-6 to resume scaling. A bend, not yet a break.

[3] Epoch AI puts the effective stock of quality-adjusted human-generated public text at roughly 300 trillion tokens, and projects that frontier models will fully use it sometime between about 2026 and 2032 (Villalobos et al., Will We Run Out of Data?, 2024). Synthetic and multimodal sources push the wall out but do not remove it, which is part of why freshly sourced external knowledge gains value as the training-token supply tightens.

[4] Shao et al., Scaling Retrieval-Based Language Models with a Trillion-Token Datastore (2024), built a 1.4-trillion-token datastore (MassiveDS) and found that datastore size "monotonically improves" performance "without obvious saturation," so that a smaller model with a large enough store outperforms a larger model alone on knowledge-intensive tasks, at lower training cost. Its internalized cousin, Berges et al., Memory Layers at Scale (2024), gives a dedicated lookup component its own parameters and reports beating dense models trained with more than twice the compute. Both point the same way: knowledge wants its own container, whether outside the model or as a separate layer inside it.

[5] Retrieval-augmented models were shown to work well before the current wave: Khandelwal et al., kNN-LM (2019), and Borgeaud et al., RETRO (2021), which improved a model by retrieving from a database of trillions of tokens. The idea is settled; shipping it at frontier latency is not.