Introduction
Transformers are more than predictive engines: they are universal learners. Unlike traditional models that require explicit retraining through weight updates for every new task, transformers exhibit the capacity to learn dynamically in context. They adapt on the fly, performing translation, summarization, regression, or classification by processing examples directly in their input sequence. This perspective frames transformers not as static function approximators, but as architectures that implement in-context mappings: functions that depend simultaneously on a token and its surrounding context.
The research by Furuya, de Hoop, and Peyré (2024) places this capacity on firm mathematical ground. They prove that transformers are universal in-context learners: they can approximate any continuous in-context function with arbitrary precision. This universality holds even when the model’s embedding dimension and number of attention heads remain fixed, and it extends to settings with arbitrarily many, or even infinitely many, tokens.
A Measure-Theoretic View of Context
The key insight of the paper is that a context should not be seen as a fixed sequence of tokens, but as a distribution of information. Rather than working with discrete sets of symbols, the transformer processes the entire collection of examples as if they formed a probability distribution. This shift in perspective makes it possible to study transformers in both finite and infinite contexts, while preserving the continuity needed for universal approximation.
By reformulating attention mechanisms in this way, the authors provide a unifying mathematical framework. It shows that transformers do not operate on isolated tokens but on structured collections of contextual information, which allows them to generalize across vastly different input sizes and types.
Unmasked Transformers and Universality
When transformers are not restricted by causal attention, as in vision transformers or bidirectional encoders, their universality is most transparent. The authors prove that such models can approximate any continuous mapping from a set of contextual examples to an output. Importantly, this capacity does not require scaling the embedding dimension or the number of attention heads indefinitely. Even with fixed architecture size, an unmasked transformer can represent arbitrarily complex relationships.
This result formalizes what practitioners already observe: unmasked transformers can adapt to a wide variety of tasks without retraining, learning directly from contextual cues. The authors also show that this property holds even when the number of tokens grows without bound, which means universality extends to settings such as large images or long documents.
Masked Transformers and Causal Learning
The situation is more constrained in autoregressive models, such as GPT, where each token can only attend to its predecessors. To capture this formally, the authors introduce the idea of augmenting tokens with their positions in time, ensuring that causality is respected. Under this formulation, they prove that masked transformers are also universal learners, but with stronger assumptions.
Specifically, the distributions of examples must evolve smoothly with respect to time, and the learning process must be identifiable from the past alone. Within these constraints, masked transformers are shown to approximate any causal learning rule. This explains how autoregressive models can generate coherent reasoning patterns from a few examples, without storing or memorizing explicit solutions.
Compositional Structure of Transformers
Both masked and unmasked transformers achieve universality not in a single step, but through the composition of multiple layers. Attention mechanisms and feedforward networks act repeatedly on the contextual information, gradually building richer representations. The paper provides a precise description of how these compositions function, ensuring that the universal property is preserved as depth increases.
Practical Example: Regression in Context
The theory is not purely abstract. The authors demonstrate how transformers can perform regression directly in context. Given a series of input-output examples, the model infers the underlying relationship and applies it to new inputs, all without adjusting its parameters. This confirms in a concrete case what the universality results predict: transformers are capable of learning rules dynamically during inference.

Limitations and Open Questions
Despite the strength of the results, several limitations remain. The proofs guarantee universality but do not provide explicit estimates of how large or deep a transformer must be to achieve a desired level of accuracy. The mathematical techniques used also force the number of attention heads to scale with output dimension, which may not reflect optimal real-world requirements.
Additionally, the analysis does not cover advanced positional encoding schemes, such as rotary embeddings, which are crucial in modern models. Extending the framework to include them remains an open challenge. Finally, universality describes what transformers can represent in theory, but it does not ensure that standard training methods will reliably find these representations. Issues such as convergence and efficiency in practice are left for future research.
Conclusion
The findings of Furuya and colleagues confirm that transformers are not just sophisticated sequence models but genuine universal learners. They can adapt to new tasks instantly, interpreting examples within context and generalizing without retraining. This redefines how we understand learning in artificial intelligence, moving beyond weight updates and towards models that learn continuously through context itself.
Reference
Furuya, Takashi, Maarten V. De Hoop, and Gabriel Peyré. “Transformers Are Universal In-context Learners.” OpenReview, February 14, 2024. https://arxiv.org/pdf/2402.09368