The Bitter Lesson Gets More Bitter: Betting Against Scale

TL;DR: Some people keep saying AI scaling is dead, but they've been wrong three times now. The pattern is simple: you scale massively first to discover what's possible (costs $100M+), then you optimize to make it cheap and put it on device. Only a handful of companies can afford the discovery phase, everyone else is just deploying what they built.

The scaling skeptics have been wrong three times now.

They'll be wrong again.

2021: Experts everywhere declaring neural scaling laws would plateau. GPT-3's 175 billion parameters represented peak sensible size, right? Then GPT-4 showed up with capabilities that smaller models couldn't even approach. The training cost jumped from $4.6 million to over $100 million[^1].

Scale won.

2023: The narrative shifted hard. "Efficient models" became the rallying cry across AI Twitter and research labs. Phi-3, Mistral, specialized small models proving you didn't need massive parameter counts. Era of brute-force scaling? Over.

Except here's what nobody mentioned: those efficient models only existed because someone already spent hundreds of millions training massive teacher models first. The small models learned through knowledge distillation, inheriting representations from scaled-up predecessors.

Scale won again, just in disguise.

2024: OpenAI's o3 spent roughly $1.1 million in compute to score 87.5% on ARC-AGI benchmarks[^2]. That's using 172x more compute for about 5% more accuracy. Critics immediately called it unsustainable. Test-time compute opened a new dimension though — not bigger models, longer thinking.

Scale won a third time, different direction.

The pattern couldn't be clearer. Yet every year, smart people bet against it.

What Sutton Actually Said

Back in March 2019, AI researcher Rich Sutton wrote something that made a lot of people uncomfortable. His essay "The Bitter Lesson"[^3] argued that general methods leveraging computation always beat approaches encoding human knowledge. Always.

The pattern showed up everywhere. Chess? Brute-force search beat decades of chess expertise. Go? AlphaGo's compute demolished hand-crafted heuristics. Speech recognition? Statistical methods crushed phonetic rules. Computer vision? Deep learning made feature engineering obsolete.

Sutton was blunt about it:

"The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning."[^3]

Six years later? The lesson's only gotten more bitter for those who ignored it.

The Two-Stage Pattern Nobody's Talking About

Here's what the whole scaling debate misses.

Scaling and efficiency aren't opposing forces. They're sequential stages.

Stage One: Scale to discover what's possible. Spend $100 million training GPT-4. Run AlphaFold on massive compute clusters. This feels wasteful. Giant overparameterized models searching solution spaces that smaller architectures literally cannot reach. But it's not waste — it's exploration. The only way to find capabilities you didn't know existed.

Stage Two: Optimize for deployment. Take what you discovered and make it practical. Distill large models into variants that fit on phones. Compress, quantize, prune. Move inference on-device where it makes sense.

You cannot skip Stage One.

Apple learned this the hard way. Their research with 30 model compression experts found that to deploy efficient models on-device, you first need to "drastically compress" large models trained at scale[^4]. A 3 billion parameter model trained from scratch finds different local minima than a 3B model distilled from a 70B teacher. The large model discovers representations the small model can't reach on its own.

Look at Meta with Llama 3.2. They released 1B and 3B variants and everyone cheered about democratizing AI. But Meta already spent hundreds of millions on Stage One. Open-sourcing Stage Two actually amplifies the value of their training investment — builds an ecosystem, makes their compute spend more valuable, not less.

The training compute is the moat. The deployment is marketing.

Six Years of Vindication

Language Models

Training compute for AI models grew 4x per year[^5]. GPT-4 burned through 2.1 × 10²⁵ FLOPs during training[^1]. The cost? Over $100 million, up from GPT-3's $4.6 million[^1].

What did that money buy? Emergent capabilities that appeared out of nowhere at scale. Three-digit arithmetic went from impossible to trivial when models hit a specific size threshold. Models that couldn't count reliably suddenly write production code and discuss academic papers coherently.

Protein Folding

AlphaFold crushed a 50-year-old grand challenge. Not by encoding biological principles — by throwing scaled deep learning at the structure space[^6]. By 2024, AlphaFold 3 predicted structures of complexes with DNA, RNA, ligands, showing at least 50% improvement over existing methods[^6].

Funny thing: the protein folding problem technically remains unsolved in terms of mechanism. AlphaFold predicts structures without revealing underlying rules[^6]. But the practical problem? Demolished by scale.

Reasoning Models

o3 represents test-time compute scaling. Scored 75.7% on ARC-AGI, with high-compute configs hitting 87.5%[^2]. Cost per evaluation at highest performance? Roughly $1.1 million[^2]. That's not training — that's inference. The model runs what looks like Monte-Carlo tree search over chains of thought. AlphaZero-style exploration, except at test time.

Scale again. Just moved from training to inference.

The "Counter-Examples" That Prove the Rule

Every efficient breakthrough validates the pattern. Knowledge distillation needs large teacher models — you need the $100M training run before you can create the 3B student. LoRA makes deployment cheaper, not training cheaper. Small models like Phi-3 still trained on massive compute, just with better data.

These are Stage Two optimizations pretending to be Stage One alternatives.

Stage Two Moves to Your Phone

The global on-device AI market was $8.6 billion in 2024. Projected to hit $36.64 billion by 2030 — that's 27.8% annual growth[^7]. In the U.S. alone, it's growing at 29.2%[^8].

This isn't evidence that scaling failed. It's evidence that scaled models matured enough to compress for deployment.

Research shows running AI inference locally on devices can actually be more resource-efficient than cloud[^9]. Apple's M3 Ultra chip (announced March 2025) delivers 2.6x the performance of M1 Ultra[^8]. Smartphones now run 3B parameter models that needed data centers five years ago.

But here's what matters: those on-device models only exist because someone paid for Stage One first.

Apple Intelligence? Distilled from massive teacher models trained with billions in compute. Amazon used over 1 million hours of unlabeled speech data to generate soft targets for Alexa's acoustic model[^10]. That $36 billion on-device market in 2030 exists downstream of hundreds of billions spent on training.

This pattern intensifies. Privacy stuff (biometric auth, health monitoring, personal assistants) moves on-device. Real-time apps (AR, autonomous vehicles) demand local processing. But the capabilities deployed locally were discovered at scale first.

Stage Two migrates to edge devices. Stage One gets more expensive.

Who Can Afford Stage One?

The two-stage pattern creates brutal economics.

Only a handful of orgs can afford frontier Stage One: OpenAI (Microsoft-backed), Google/DeepMind, Meta, Anthropic (Amazon and Google money), Apple. That's basically it. Some well-capitalized Chinese labs catching up. Everyone else? Optimizing deployments.

If you don't have $20M minimum (realistically $100M+) for training, you get four options:

Rent via APIs — OpenAI, Anthropic, Google. You're paying for access to their Stage One investment.
Fine-tune open source — Llama, Mistral. Someone else paid for Stage One, you optimize for your domain.
Build applications — Most AI startups. You're not creating capabilities, you're deploying them creatively.
Wait for commoditization — Each year, last year's frontier becomes cheap. You're always one generation behind.

None are bad strategies. But get the constraint: you're fundamentally dependent on someone else's Stage One spend.

This isn't temporary. Next-gen models might cost $500M to train. Training compute for GPT-5 or real multimodal reasoning systems will dwarf current spending. The moat around Stage One widens.

Meanwhile, Stage Two gets radically more efficient. Better distillation, improved quantization, specialized inference chips. We'll run today's GPT-4 equivalent on phones within three years. The capability gap between frontier and commodity shrinks.

But the capability frontier keeps moving. Always requiring Stage One scale to discover.

What This Means for Everyday Stuff

On-device AI is getting practical for routine tasks. Your phone handles more each month:

Email drafts and summaries (already here)
Photo editing (shipping now)
Real-time translation (improving fast)
Health tracking insights (early days)
Personal scheduling (coming soon)

The shift to on-device for everyday tasks makes economic sense. Cloud inference costs money per query. On-device costs nothing after deployment. For high-frequency, low-stakes tasks, local wins.

But complex reasoning? Specialized knowledge work? That stays in the cloud. Writing code with cutting-edge models, analyzing complex documents, generating high-quality images, solving novel problems — these need frontier models that won't fit on devices for years.

The equilibrium: routine automation moves on-device (Stage Two), capability expansion stays in massive data centers (Stage One). Hybrid is reality. Your phone runs distilled models for common stuff. Cloud models handle complex requests. The orchestration happens invisibly.

The Pattern Won't Break

This mirrors every industry where exploration costs big money and deployment scales cheap.

Pharma? Spend $2 billion discovering a drug, then generics optimize production. Semiconductors? Billions designing new architectures, then manufacturing scales. Software platforms? Billions building Windows or iOS, then millions of cheap apps layer on top.

Expensive exploration creates capability. Optimization democratizes it. AI follows the same pattern, just faster.

Sutton's lesson wasn't about what works technically. It's about the fundamental economics of intelligence. Human knowledge approaches create local maxima that feel satisfying but cap out fast. Compute-based approaches feel wasteful and unsatisfying but scale to capabilities nobody imagined.

Chess to Go to vision to language to reasoning. The pattern couldn't be clearer.

Betting against compute scaling has a perfect losing record over 70 years.

The next frontier won't escape scaling. It'll find new dimensions: training compute (still growing), test-time compute (emerging), recursive self-improvement (early research), maybe post-deployment learning at scale. The principle stays the same — more compute, more search, more learning wins.

Researchers who internalize this build breakthroughs. Those fighting it hit walls repeatedly.

The bitter lesson gets more bitter each year. Not because it's wrong. Because it's relentlessly, predictably, annoyingly right.

[^1]: What is the Cost of Training Large Language Models [^2]: Scaling Up: How Increasing Inputs Has Made AI More Capable [^3]: The Bitter Lesson by Rich Sutton [^4]: Model Compression in Practice - Apple Machine Learning Research [^5]: Can AI Scaling Continue Through 2030? [^6]: Highly accurate protein structure prediction with AlphaFold [^7]: On-Device AI Market Report - Global [^8]: US On-Device AI Market Report [^9]: Shifting AI Inference From Cloud to Phone Can Reduce Costs [^10]: How Knowledge Distillation Works and When to Use It