Introduction
Can a neural network perform complex tasks without learning its weights at all? This counterintuitive idea lies at the heart of a recent theoretical breakthrough. While conventional training methods adjust both weights and biases to optimize performance, new research by Williams, Balcan, Reichman, and White proposes a radical simplification: freeze all the weights—assign them randomly at initialization—and train only the biases. Surprisingly, this minimalist approach preserves much of the expressive power typically attributed to full training. In doing so, it challenges long-held assumptions in deep learning and opens up new paths for lightweight, efficient neural models.
This article unpacks the findings of the study, exploring how bias-only training with random weights can approximate complex functions, how depth enhances flexibility, and what this means for the future of neural network design.
The Core Hypothesis: Freezing Weights, Tuning Biases
Most neural networks rely on learned weights and biases—two parameters traditionally adjusted during training. The authors of this study ask: what happens if we remove weights from the learning process entirely? That is, suppose we sample all weights from a random distribution at initialization and never update them. Instead, we only train the biases.
This question, which may seem purely theoretical, has profound practical implications. Reducing the number of trainable parameters simplifies optimization, reduces memory usage, and could help in hardware-constrained or privacy-preserving settings. But does it actually work?
The study answers with a resounding yes. Under this constrained regime, neural networks—both shallow and deep—retain an unexpectedly high level of expressivity, the mathematical term for their ability to represent a wide class of functions.
Shallow Networks: Universal Approximation Still Achievable
A single hidden-layer network (i.e., a shallow network) with random weights and trainable biases still retains universal approximation capabilities. The authors prove that such networks can approximate any continuous function on a compact domain to arbitrary precision—provided the number of neurons is sufficiently large.
This result builds upon classic universal approximation theorems, but with a twist: instead of adapting both weights and biases, the network learns only biases. These biases shift the activation functions, allowing the network to "slide" the fixed shapes created by the random weights into alignment with the target function. Theoretical analysis shows that with a dense enough hidden layer and a sufficiently expressive activation function, the network can approximate smooth functions on compact subsets of
In essence, while random weights lock in the "geometry" of the network, biases remain free to shift this geometry to fit the data.
What Are "Smooth Functions on Compact Subsets of ℝd ?
To understand the kinds of functions that neural networks are learning to approximate, we need to unpack the phrase “smooth functions on compact subsets of
Think of
A compact subset of this space is like a sealed, finite box—it contains all the points inside it, including the boundary, and it doesn’t stretch out infinitely. Imagine a closed interval on a number line, or a solid ball in 3D space. You can walk around it, but you’ll never fall off an edge or wander off forever.
Now, a smooth function is one that behaves exceptionally well—it’s the mathematical equivalent of a perfectly flowing curve. There are no kinks, jumps, or corners. If you were to draw it, your pen would never need to lift from the paper, and the curve would gently transition in any direction you like. Mathematically, this means the function has derivatives of all orders, and they’re all continuous.
So, when researchers say that a neural network can approximate “smooth functions on compact subsets of
This kind of approximation is foundational in understanding how even constrained neural architectures can still represent intricate behaviors seen in real-world data.
Deep Networks: Layers Compensate for Constraints
The authors go further to show that depth amplifies expressivity. Deep networks with randomly initialized weights and learned biases can simulate a broad family of compositional functions. By stacking layers, the model leverages nonlinearity at each stage, allowing biases to interact in increasingly complex ways.
In fact, the study proves a kind of depth separation: there exist functions that a two-layer network with bias-only learning can approximate with far fewer parameters than a single-layer counterpart. This depth advantage mirrors results in standard networks but is striking here given the strong constraint of fixed weights.
A key takeaway: depth acts as a compensatory mechanism. While the weights are fixed and randomly chosen, deeper architectures recover expressivity by offering more layers where biases can reshape the input at progressively abstract levels.
Practical Implications and Use Cases
Why does this matter? For one, bias-only training reduces the number of parameters that need optimization, offering faster convergence and lower hardware requirements. This makes the technique particularly attractive for low-resource environments—think mobile devices, edge computing, or federated learning systems, where communication and computation are at a premium.
Furthermore, the fixed-weight architecture introduces a form of built-in regularization. Since the weights are not updated, the model is less prone to overfitting. This could be especially useful in scenarios with limited data or high noise, such as sensor networks or real-time medical diagnostics.
The authors also note that this approach aligns naturally with privacy-preserving machine learning. Because the only data-dependent updates occur in biases, smaller parameter deltas need to be shared across devices in collaborative learning environments.

Theoretical Innovations: Proving Expressivity
The strength of this paper lies in its rigorous mathematical treatment. The authors introduce several key technical contributions:
- Bias Function Spaces: They define function classes that can be realized by adjusting only biases, given fixed random weights. These are shown to be dense in C(K)C(K), the space of continuous functions on compact domains.
- Activation Function Conditions: Expressivity results are proven for a wide class of activation functions, including ReLU, sigmoid, and others—offering general applicability.
- Bias-based Depth Separation: They construct explicit examples where deep networks with trainable biases outperform shallow ones, reinforcing the power of layered abstraction.
- Approximation Error Bounds: The study quantifies how many neurons are needed to reach a certain approximation error, showing that performance scales favorably with network width and depth.
These theoretical results are backed by careful analysis of the geometry of activation regions and how biases can translate and reshape these regions to approximate target functions.
Key Theoretical Conclusion
The researchers prove that neural networks with fixed random weights and learned biases can universally approximate:
- Functions: For feedforward networks (FNNs), with a wide enough hidden layer and a suitable activation function, the network can approximate any continuous function on compact subsets of
ℝd to arbitrary precision. - Dynamical Systems: For recurrent networks (RNNs), the same holds for approximating finite-time trajectories of smooth dynamical systems.
Critical Conditions
- Activation Functions: The activation must be "bias-learning", meaning it can "mask" units via bias tuning.
- Width: The hidden layer must be sufficiently wide (existence guaranteed, though bounds are loose).
- Random Weights: Weights are frozen and sampled from a uniform distribution, but biases are optimized.
Limitations and Open Questions
While the expressivity of bias-trained networks is impressive, there are important caveats. The authors acknowledge that certain geometric tasks, such as those requiring translation or rotation invariance (common in image processing), may not be well-suited to random weights. In such cases, convolutional architectures or structured weight initialization may be required.
Additionally, optimization remains a challenge. Training only biases can lead to ill-conditioned gradients in deep networks. The authors suggest that further research into adaptive learning rates or second-order methods could mitigate this issue.
Another limitation is that expressivity does not imply generalization. While the network can represent complex functions, whether it can generalize from finite data remains an open question. Empirical studies are needed to understand performance across a wider range of real-world tasks.
Broader Impact: Redefining Network Design
This work challenges the default assumption that weight learning is essential. By showing that neural networks can achieve universal approximation through biases alone, it invites a reevaluation of network design.
Future extensions might include:
- Hybrid architectures, where only parts of the weights are learned while others remain fixed.
- Hardware-efficient models, optimized for neuromorphic or analog devices where weight updates are costly or impractical.
- Bias-only federated learning, enabling lightweight model updates with minimal communication.
As deep learning matures, we may see a shift toward modular, sparsely trained systems, where carefully chosen parameters are updated while others are left static. This study lays a compelling foundation for that vision.
Conclusion
The idea that neural networks can learn rich functions without adjusting their weights is both surprising and revolutionary. Williams et al. show that with fixed random weights and trainable biases, both shallow and deep networks retain much of the expressivity that has made neural networks so powerful. Their findings challenge conventional design philosophies and point toward more efficient, scalable, and robust approaches to learning.
Their work challenges the conventional necessity of weight tuning in fine-tuning neural networks, highlighting the effectiveness of bias-only tuning in architectures such as transformers. Drawing inspiration from neuroscience, it suggests that biological systems may similarly rely on bias-like mechanisms—such as neuromodulation—for rapid adaptation without requiring changes to synaptic weights. Most significantly, the study conclusively demonstrates that neural networks with random weights and learned biases can serve as universal approximators, given sufficiently wide hidden layers and expressive activation functions. This finding extends classical universal approximation theorems to a drastically reduced parameter regime in which only the biases are trained.
As we push the boundaries of AI into new frontiers—on devices, at scale, and under constraints—this minimalist approach could prove not just elegant, but essential.
Reference
Williams, Ezekiel, Maria-Florina Balcan, Daniel Reichman, and Colin White. "Expressivity of Neural Networks with Random Weights and Learned Biases." arXiv preprint arXiv:2407.00957v3 (2024). https://arxiv.org/abs/2407.00957