Understanding Is Compression That Transfers

June 2026

In my earlier essay, The Compression Paradox, I argued that human progress depends on a strange loop.

We compress the world into words, symbols, theories, models, stories, and tools. That compression lets us think and communicate. It also loses something. The loss creates gaps. The gaps create new intuitions. Those intuitions get compressed again into better language, better models, and better tools.

That is the paradox. Compression makes understanding possible. Compression also guarantees that understanding is incomplete.

I now think there is a simpler version of the same idea.

The simpler idea is this: understanding is compression that preserves what matters.

More precisely: understanding is compression that transfers.

Reality is too large to store directly. Every intelligent system survives by compressing it into useful models. A mind, a society, a market, and an AI system are all doing versions of the same thing: turning overwhelming detail into compact predictions that can guide action.

Learning means finding the short rule behind many examples. Memorization means storing exceptions. Understanding means replacing a large table of cases with a smaller generative model. Creativity means recompressing known material into a new representation that explains or produces more with less.

This is true in people. It is true in science. It is true in software. It may also be true in AI.

The hierarchy is simple. Compression is universal. Useful compression is task-relative. Understanding is compression that transfers to the changed example. Memorization is compression that only helps on the seen case. AI gives us a way to measure the boundary. The future is deciding what belongs in the model, what belongs in memory, and what belongs in tools and institutions.

But the important part is not compression as a slogan. The important part is the constraint. What gets preserved? What gets thrown away? What is treated as signal? What is mistaken for noise?

A child does not learn language by memorizing every sentence she will ever hear. She learns rules, patterns, and abstractions that let her produce sentences she has never encountered. A physicist does not understand falling objects by keeping a catalog of every fall. She writes down a law. A good programmer does not solve a class of problems by writing every answer. He writes a function.

The same distinction is now becoming central in machine learning. A model can fit data in two ways. It can find the pattern that transfers. Or it can memorize the surface details that do not.

Both reduce training error. Only one is understanding.

This matters because modern AI has made the old philosophical question measurable. If prediction is compression, then a better predictor is a better compressor of the stream it is modeling.¹ This is not a metaphor. It is an information-theoretic identity. But the interesting question is not whether prediction and compression are related. They are. The interesting question is what kind of compression the system has learned.

Has it found the rule, or has it stored the table?

This question keeps pulling me in because it is simple and still seems under-examined. The idea is to create tiny controlled worlds where the true rule is known. Then we add random noise to some labels. This gives the data two kinds of information.

There is pattern: the rule that works on new examples.

There is residue: arbitrary detail that only helps on the exact training examples where it appeared.

Because we inject the noise ourselves, we know the entropy floor. No model can predict fresh random noise better than that floor. But a model can beat the floor on the training set if it has memorized the particular noise it saw. So below-floor training fit becomes a certificate. It says: the model is storing arbitrary detail.

That gives us a clean experimental handle on a vague word. Memorization becomes bits.

The bigger bet is about where memorization turns on. Small models often cannot afford to store everything. They may be forced to find the rule. Larger models can learn the rule and also store the residue. Modern neural networks are already known to be capable of fitting random labels, which is exactly why the line between generalization and memorization matters.² The hypothesis is that there is a critical capacity where arbitrary fit begins, and that this capacity increases with the description length of the task.

There is an important caveat. Memorization is not always failure. Bad memorization preserves arbitrary residue. Good memorization preserves rare-but-real structure. In long-tailed natural data, the rare case can be the signal; preserving it can improve generalization rather than weaken it.³ The distinction is not memory versus no memory. It is whether the stored detail helps outside the exact case where it was seen.

In plain English: harder rules require bigger models before memorization becomes cheap.

If this is true, it reframes part of what we call emergence. Some sudden changes in model behavior may not be magic. They may be threshold crossings in a storage budget. The model stops carrying a giant lookup table and starts carrying the rule, or it crosses the opposite threshold and starts carrying useless details as if they were structure.

This does not prove that compression is intelligence. That sentence is too broad to be useful.

The better claim is narrower and stronger: intelligence depends on compressions that preserve transfer-relevant structure. Bad compression throws away the signal. Bad memorization preserves arbitrary residue. Understanding is the compression that still works when the example changes.

The broader claim should follow from this, not compete with it. The same hierarchy appears at larger scales. That is why I think it deserves broader attention.

A mind is a compression engine. It turns overwhelming sensory data into a usable world model. Consciousness may be the live interface of that compression: a compact representation of the world, the body, the self, and the next possible actions. The self is not a ghost in the machine. It is a persistent model that binds memory, agency, body, social identity, and future planning.

Science is collective compression. A theory replaces a pile of observations with a shorter rule. Theories fail when the compression drops something important. Progress often begins as discomfort with what the old compression cannot explain.

Society is distributed compression. Language compresses experience. Law compresses conflict into rules. Science compresses observations into theories. Culture compresses survival knowledge into stories, rituals, norms, and institutions. Education compresses centuries into a curriculum. Bureaucracy compresses judgment into procedure. Good societies preserve useful compressions while staying able to update them. Bad societies mistake old compressions for reality itself.

The economy is also a compression system. Prices compress distributed information about scarcity, demand, labor, risk, and opportunity. Hayek's famous argument for the price system was, in part, an argument about the use of dispersed knowledge.⁴ Firms exist because some coordination is cheaper to compress inside an organization than through market transactions. Money, contracts, accounting, brands, and credit are all compression formats. They are powerful because they compress. They are dangerous for the same reason. A price can carry a lot of information. It can also omit the thing that matters most.

The future of AI will depend on whether we learn to build systems that separate pattern from arbitrary detail. By pattern, I mean the reusable structure that transfers to new situations. By arbitrary detail, I mean facts, quirks, correlations, and residues that may matter in one context but should not become part of the model's deepest picture of the world.

Bigger models will matter. But size alone is not the end state. We will likely want systems that keep durable abstractions in the reasoning core and move contingent information into memory, retrieval, tools, databases, and verifiable workflows. This is also a familiar human pattern: our thinking already extends through notebooks, search, software, and other external supports.⁵ The point is not to store less. The point is to store the right things in the right places.

The risk is not that AI compresses. Everything intelligent compresses. The risk is that it compresses the wrong thing.

To compress the wrong thing is to preserve the wrong signal and discard the thing that mattered. An AI system may compress human values into a metric. It may compress truth into plausible text. It may compress judgment into optimization. It may compress a society into categories that are easy to measure but false to live inside. The danger is not abstraction. The danger is a bad abstraction becoming powerful.

This is why the compression paradox is not just an essay about cognition. It is becoming a deeper inquiry into how minds, institutions, markets, and machines decide what counts as signal.

The core belief is simple.

The world is too large to store. Intelligence is the search for the representations that make it usable. Progress comes from better compression.

But every compression has a remainder. A map leaves things out. A theory ignores variables. A price omits context. A model forgets texture. Sometimes the omitted material is harmless. Sometimes it is the load-bearing part.

That is why constraint matters. A good compression is not merely small. A good compression preserves what matters for the next prediction, the next action, the next person, or the next world it will be asked to explain.

The next step is to measure the difference between the rule and the residue. We need to know when a learner has discovered reusable structure, and when it has merely stored surface detail. The question is whether a system has preserved what transfers or merely stored what happened.

That would make understanding less mystical. Understanding is what happens when the short rule beats the lookup table.

Notes

[1] The prediction-compression connection is exact in the setting of lossless coding: a probabilistic predictor can be turned into a compressor through arithmetic coding, and better log loss means a shorter code for the same stream. See Deletang et al., Language Modeling Is Compression, ICLR 2024.

[2] Zhang, Bengio, Hardt, Recht, and Vinyals showed that standard neural networks can fit random labels, helping motivate the modern generalization puzzle: why do systems with enough capacity to memorize still generalize on real data? See Understanding deep learning requires rethinking generalization.

[3] The pattern/arbitrary distinction should not be read as "all memorization is bad." Vitaly Feldman's Does Learning Require Memorization? shows why memorization can be necessary for near-optimal learning on long-tailed natural distributions. The target here is structureless residue, not rare but real structure.

[4] Hayek's The Use of Knowledge in Society is the canonical statement of prices as a mechanism for coordinating dispersed local knowledge. This essay uses that idea in a broader information-theoretic sense, not as a blanket defense of any particular market outcome.

[5] Andy Clark and David Chalmers argued that tools can become part of a cognitive system when they reliably perform the same functional role as internal memory or reasoning. See The Extended Mind, Analysis, 1998.