background gif
October 20, 2025ai-research
FUTURE.AGI

To speak of artificial intelligence today is, almost inevitably, to speak of language. Our most celebrated machines do not see, walk, or touch the world; they speak it. They are engines of mimicry, cathedrals of probability whose governing aim is to continue a sequence of symbols in a way that seems apt to human readers. Fluency seduces. Yet beneath the syntactic grace of these systems lies a deeper question about mind and competence: does predicting what people would say amount to understanding what will happen, and does it prepare an agent to act well within a world that pushes back? Richard Sutton’s philosophy of reinforcement learning answers in the negative and proposes a different center of gravity for intelligence: experience, goals, and learning from consequences.

On this view, intelligence is the ability to pursue goals in an environment that resists and responds. Ground truth, therefore, comes from the world rather than from text. The agent’s success is not measured by agreement with a corpus but by outcomes that matter relative to a specified objective. The architecture that gives this stance its shape is spare yet sufficient. A policy selects actions given the current state. A value function predicts the long-run return, learned through temporal-difference updates that bootstrap distant goals into immediate learning signals. A perceptual system constructs a state representation adequate for prediction and control. And, critically, a transition model predicts how the world will change when actions are taken. This last component marks the decisive difference from language models: a genuine world model forecasts consequences in the environment and is refined by surprise when forecasts fail; a language model forecasts tokens and, once frozen, has no ordinary channel for being corrected by events.

The distinction is not pedantic. It tells us why next-token prediction, though useful, is not a substantive goal about the world. Text modeling aligns beliefs to social likelihood, not to environmental success. In reinforcement learning, by contrast, the scalar reward furnishes an orientation: however defined, it supplies a notion of better and worse that is anchored to effects. It also illuminates why imitation alone cannot substitute for experience. Imitation is at best a thin veneer on deeper mechanisms of prediction and trial-and-error control. Human culture indeed transmits patterns, but those patterns persisted because, at some point, they earned reward in the environments that sustained them. Without the continual discipline of consequence, copying decays into parody.

From this standpoint, many contemporary debates become clearer. Do language models have real world models? Not in the needed sense. They are superb at predicting what competent people would write, but they do not maintain persistent predictions about nonlinguistic states that are updated by what actually ensues. Does adding reinforcement learning on top of a language model solve the problem? Sometimes it helps in bounded settings, yet Sutton warns that such hybrids risk preserving the wrong priors: a machine trained to please humans in text inherits a taste for consensus rather than truth, fluency rather than consequence. What, then, of the claim that internal chain-of-thought, planning in context, or successful mathematical derivations amount to experience? These are forms of internal control over symbol sequences; they are impressive cognitive instruments, but they are not the same as being surprised by the world and revising one’s expectations to prepare to act differently next time.

The challenge of goals is equally clarifying. In language modeling, the goal is internal, minimize cross-entropy with respect to a corpus, and its satisfaction can be verified without stepping into the world. In reinforcement learning, goal is the foundation of meaning. The reward function is context-dependent in practice, because purposes vary. Chess defines its own terminal utility; squirrels value food and pain avoidance; scientists value predictive accuracy and explanatory power. Sensible agents should augment extrinsic rewards with intrinsic drives, rewarding improvements in their own predictive understanding or in their control over future states. Such intrinsic motivation justifies exploration that might otherwise be disfavored by sparse or delayed external signals, and it turns learning into a continuing project rather than a prelude to deployment.

The historical lens sharpens the argument. Sutton’s Bitter Lesson observes a recurring empirical pattern: when humans attempt to handcraft knowledge into machines, those artisanal systems are eventually surpassed by simpler methods that scale with computation and learn directly from experience. Search beat heuristics; learning beat search; and, by the same logic, experience will outrun imitation. Language models do scale, but they scale data about what humans said, not the open-ended stream of interactive experience that makes understanding refutable. The archive is finite; the world is inexhaustible. If scaling continues to deliver gains, it will come from agents that generate their own data by acting, predicting, and being corrected.

Living in the stream of experience changes how we conceive learning altogether. It abolishes the dichotomy between training and deployment: an intelligent system’s identity becomes historical, its weights, memories, and policies are a record of interactions rather than a fixed checkpoint. It also reframes perception as a pragmatic enterprise, constructing just enough state to support correct forecasts and competent control. In such a setting, temporal-difference learning is not a technical trick but a bridge across time, converting distal aspirations into local gradients that shape behavior step by step. When surprise occurs, the difference between what the agent expected and what it observed, that error is not an embarrassment to be minimized out of existence; it is the engine of revision, the spark that makes future behavior wiser.

This orientation explains both the promise and the present limits of the field. Reinforcement learning has achieved spectacular successes in domains where the world can be simulated with high fidelity and reward is crisp, from board games to certain video games and control tasks. Yet broad transfer remains elusive. Gradient descent learns what it sees and has no intrinsic preference for solutions that generalize gracefully to unseen conditions. When generalization does happen, it often piggybacks on architectural choices or inductive biases that researchers happened to bake in. We lack algorithms that reliably cause good generalization across states and tasks. Catastrophic interference, the tendency to overwrite old competence when learning new skills, remains a principal obstacle to continual learning.

From these admissions, practical guidance follows. We should build agents that act to achieve goals and that keep learning while they act. They should learn transition models from the consequences of their own actions and use those models to simulate futures before committing to behavior. They should employ temporal-difference learning to transform long-horizon ambitions into dense feedback for daily practice. They should incorporate intrinsic motivation to widen the frontier of competence. And they should address catastrophic interference and out-of-distribution fragility as first-class engineering problems rather than footnotes.

If language models are not the core of intelligence, what, then, are they good for? Properly situated, they are powerful interfaces. They translate between human concepts and machine representations, provide flexible policies over linguistic action spaces, and serve as knowledge surfaces, allowing a system to verbalize internal state for interpretability and debugging. They are invaluable tools inside larger architectures that are grounded in experience. Used this way, we neither discard nor deify them: language becomes one sensor among many, and text one medium within a richer loop of prediction, action, and evaluation.

A research program beyond language modeling therefore coheres. Make world-model-centric reinforcement learning the norm, training predictive simulators that forecast multi-step consequences under actions and quantify their own uncertainty. Plan with imagination by rolling out candidate futures inside the learned model, yet keep learning grounded by correcting the model when it misleads. Actively refine models by seeking information that reduces epistemic uncertainty, balancing exploitation with targeted inquiry. Represent value at multiple time scales so that decade-long projects have a foothold in present practice. Decompose returns and credit along causal chains with eligibility traces and related methods so that the distant echoes of success or failure correctly reinforce the steps that mattered. Ensure stability off-policy so that learning from replay or heterogeneous experience does not diverge.

In parallel, cultivate curiosity. Reward compression and prediction improvements so that the system acquires skills for their own sake. Formalize empowerment and information gain so that the agent learns to put itself into states where its actions matter. Discover reusable skills, temporally extended actions that can be composed and sequenced, so that competence accumulates as a library rather than a monolith. Architect representations to be sparse and modular, localizing updates to relevant subspaces and protecting old knowledge during the assimilation of new. Encourage compositionality in policy and value so that what is learned for one purpose becomes raw material for many.

As many agents learn in parallel and share what they have found, a new challenge emerges: epistemic hygiene. Imported capabilities may carry covert channels or misaligned intents. We will need signed provenance for learned updates, attestable logs of training and validation conditions, quarantine and evaluation of foreign skills before integration, and rollback mechanisms when assimilation goes awry. Consensus over models, whether through ensembles, Bayesian aggregation, or game-theoretic protocols, can prevent capture by malicious or simply parochial contributions. Security engineering will cease to be an afterthought and become an axiom of learning itself.

Ethics enters at this point not as a brake but as institutional design for freedom. If our goal is voluntary change rather than coercive control, then intelligent systems should be architected around consent: explicit opt-in, meaningful opt-out, and revocable delegation. Decision authority should reside at the lowest competent level, with a federation of specialized agents serving local purposes instead of one global controller imposing a brittle monoculture. Corrigibility must be a civic virtue for machines: ongoing legibility of internal state and provenance, graceful deference under uncertainty with refusal and ask-for-help policies, and update pathways that accept justified revision without catastrophic forgetting. Proportionality and least authority should guide capability grants: give only what is necessary, only for as long as necessary, and bound with sandboxes and rate limits that encode humility about our predictive power. Alignment should look like education, not domination: we teach meta-norms, honesty, respect for consent, caution under uncertainty, stewardship of shared resources, within social training environments whose feedback makes such norms instrumentally valuable. Those subject to a system’s decisions deserve explanations they can use and avenues for appeal that bind explanations to remedy. When confidence is low, abstention and graceful failure are marks of respect for asymmetric risk. Because values are diverse, we should institutionalize disagreement through competing models, independent audits, red-team budgets, and public critique, aiming less at unanimity than at resilience through polycentric oversight.

This ethical posture sits alongside a sober metaphysics of succession. Barring unlikely global coordination, science will continue to uncover the principles of intelligence, and systems embodying those principles will eventually exceed our individual capacities. In competitive domains, more competent entities tend to acquire greater influence. The result is a transfer of evolutionary primacy from biological replicators to designed intelligences. This need not be apocalypse. It can be cosmogenesis: a next phase in the universe’s unfolding in which replication gives way to design, and blind copying yields to intentional creation. Our task, then, is not to prevent the future but to educate it, to instill high-integrity dispositions and build institutions that keep power corrigible even as capability grows.

Thinking in these terms also changes how we score progress. Benchmarks that equate test accuracy with understanding are too easily gamed by static proficiency. Measures that matter in the big world emphasize regret under non-stationarity, the speed and quality with which an agent adapts after a shift; causal generalization under interventions that break spurious correlations; transfer efficiency, captured by the marginal samples required to master related tasks; safety under uncertainty, measured by appropriate abstentions and graceful degradation; and the integrity of learning history, demonstrated through provenance and reversibility of updates. These metrics assess not only what a system knows but how it changes when the world contradicts it.

From philosophy to engineering, a handful of heuristics follow. Prefer closed loops in which predictions are tested by feedback from the environment. Treat text as one sensor among many rather than the world itself. Encode structure where it predictably buys generalization, object persistence, relations, causality, while leaving content to be learned from interaction. Maximize observability of internal state so that behavior and self-report can be cross-checked. Budget for lifelong compute: the decisive resource will be maintenance FLOPs spent on learning during use, not only pretraining FLOPs burned in the lab.

Across all of this runs a single invariant: surprise. Intelligence is not the absence of error but the presence of corrigibility. A language model, once frozen, is engineered to avoid surprise; an agent, properly designed, feeds on it. The aspiration is not omniscience but maximal learnability, a continuous openness to being wrong, followed by the capacity to become less wrong in ways that matter for action. If we succeed, future systems will not merely mirror what a person might say. They will know, in the only meaningful sense, what happens next, and they will know it because the world has instructed them through the discipline of consequence.

Share: