A speculative research agenda argues that training vastly over‑parameterized models with extremely high learning‑rate, cyclical schedules could push them into a “catapult” regime where they abandon memorization and discover truly generalizable algorithms, potentially closing many gaps between current AI and biological intelligence.
The Core Idea
The proposal suggests a radical shift in how we scale deep learning. Instead of the usual low‑learning‑rate, data‑hungry regime that drives models to memorize huge corpora, researchers should train massively over‑parameterized networks (10‑100 × the size of today’s largest LLMs) on tiny, highly curated datasets using very high learning‑rate, cyclical schedules. The hypothesis is that such training will “catapult” the model out of the memorization basin and into a region of the loss landscape that encodes human‑like abstractions.
Why Current Models Fall Short
| Phenomenon | Current DL behavior | Human behavior |
|---|---|---|
| Sample efficiency | Requires trillions of tokens to reach competence | Learns from a few orders of magnitude less data |
| Adversarial robustness | Still vulnerable after years of scaling | Largely immune to tiny perturbations |
| Developmental stages | Smooth loss curves, no clear spurts | Distinct developmental phases with sudden skill jumps |
| Forgetting vs. memorization | Retains massive amounts of training text | Systematically forgets irrelevant details |
These mismatches point to a bias‑variance trade‑off: modern LLMs minimize variance (they fit the data closely) while human brains appear to minimize bias (they prefer simpler, more generalizable functions). The conjecture is that the bias‑minimizing regime can be reached by over‑parameterization + aggressive regularization.
The Catapult Mechanism
- Extreme over‑parameterization – models with >10 trillion, possibly >100 trillion parameters, far beyond the Chinchilla optimum.
- High, cyclic learning rates – periods of very large learning rates that push the optimizer out of local minima, followed by low‑rate phases that consolidate the new basin.
- Heavy regularization – weight decay alone may be insufficient; additional noise injection, dropout, or explicit norm constraints are needed to keep the model from simply memorizing.
- Tiny, diverse data – a few hundred million high‑quality tokens, aggressively deduplicated, ensuring the model cannot rely on rote memorization.
During the high‑LR spikes the network “explores” the loss landscape, akin to a child’s trial‑and‑error learning phase. When the LR drops, the network refines the discovered structure. Over many cycles the model may transition from a memorization‑dominated basin to a generalization‑dominated basin – the “catapult”.
Testbeds and Early Evidence
- Grokking (small nets on simple arithmetic) shows that after a long memorization phase, a model can suddenly discover an algorithm that generalizes perfectly. This is a micro‑scale analogue of the proposed catapult.
- Cyclical LR schedules (e.g., cosine annealing, SGDR) have already yielded large jumps in performance on image tasks, suggesting that the dynamics are not limited to toy problems.
- Isoperimetry studies argue that adversarial fragility stems from linear decision‑boundary “dimples”. Over‑parameterized models that learn smoother manifolds should be inherently more robust.
Proposed experiments
- Arithmetic catapult – train a 1‑billion‑parameter transformer on a filtered set of hard arithmetic problems with a high‑LR cycle. Track error on a held‑out “hard” subset; look for a cross‑over where the catapulted model overtakes a standard baseline.
- Robust image classification – use CIFAR‑100 and a stress‑test set like ImageNet‑A. Apply the same schedule to a 10‑billion‑parameter CNN/MLP and measure adversarial accuracy.
- Scaling law sweep – vary model size (1B‑100B) and number of LR cycles while keeping data constant, to see whether the exponent on the hard‑task error curve improves.
Architectural Considerations
- Dense Transformers are the natural first choice because they already support large parameter counts and have well‑studied scaling laws.
- MLPs may benefit even more: they are parameter‑efficient, but traditionally overfit. The catapult schedule could suppress their pathological memorization and let them learn convolution‑like features.
- Mixture‑of‑Experts is risky; sparsity reduces effective parameter count, potentially moving the model back into the memorization regime.
Hardware Implications
Catapult training is serial: each LR cycle depends on the previous one, and the model must see many epochs over the same small dataset. This diminishes the advantage of massive GPU farms and raises the value of low‑latency, high‑throughput hardware such as Cerebras Wafer‑Scale Engines or future photonic accelerators. Faster per‑step compute would directly reduce wall‑clock time for the many cycles required.
Economic and Strategic Impact
If a catapulted model achieves human‑level robustness and generalization, its economic profile diverges from today’s LLMs:
- Higher upfront compute cost (training a 100 t‑parameter model) but lower downstream inference cost because the model can be aggressively pruned after the catapult phase.
- Competitive moat – the specialized training recipe would be hard to replicate without substantial trial‑and‑error compute, giving first‑movers a durable advantage.
- Alignment benefits – a model that learns algorithms rather than memorized heuristics may be easier to interpret and verify, providing a cleaner substrate for safety research.
Alignment and Interpretability
A catapulted network’s core algorithmic component should be far smaller than its over‑parameterized shell. After training, one could apply pruning, distillation, or mechanistic interpretability to extract the generalizable sub‑network. This could:
- Reduce the risk of hidden deceptive behaviors that arise from memorized shortcuts.
- Offer a clearer target for formal verification of safety properties.
- Enable more transparent fine‑tuning on moral or policy‑critical data.
Open Questions
- Parameter ceiling – at what size does the sample‑efficiency benefit saturate? Is there a practical upper bound before diminishing returns set in?
- Schedule design – how to balance LR amplitude, cycle length, and decay of weight norms for maximal basin‑jumping?
- Metric selection – standard perplexity masks the catapult effect; we need benchmarks that focus on hard examples, adversarial robustness, and low‑memorization scores.
- Biological plausibility – does the high‑LR “exploration” correspond to any known neurodevelopmental process (e.g., bursts of plasticity during critical periods)?
Take‑away
The “catapult” hypothesis reframes over‑parameterization from a wasteful excess into a necessary scaffold that lets a network traverse the loss landscape far enough to discover genuinely abstract, human‑like reasoning. By pairing this scaffold with aggressive, cyclic learning‑rate regularization and tiny, high‑quality data, researchers could test whether the same principles that cause small nets to grok arithmetic can be scaled to the trillion‑parameter regime. If successful, the resulting models would be more sample‑efficient, robust to adversarial attacks, and potentially far easier to align and interpret—an attractive direction for both AI capability and safety research.
For further reading on grokking, cyclic learning rates, and isoperimetry, see the original papers linked throughout the discussion.

Comments
Please log in or register to join the discussion