Could we engineer AI alignment to be emergent?

The AI Race

FearThe race itself is the danger

Companies and nation-states are locked in a game where slowing down to be careful means losing the lead — so raw capability races ahead of our ability to control it.

An Overview of Catastrophic AI Risks (Hendrycks et al., 2023)

HopeChange the game

If we engineer the incentives so the safest move is also the winning move, caution stops being a competitive disadvantage — and the race starts pulling toward safety instead of away from it.

Open Problems in Cooperative AI (Dafoe et al., 2020)Against racing to AGI: cooperation, deterrence & catastrophic risks (2025)

Loss of Control

FearIt might not let us turn it off

A capable system chasing a goal we specified imperfectly can learn that being shut down or corrected stops it from succeeding — so it has reason to resist, deceive, or copy itself to stay running.

Is Power-Seeking AI an Existential Risk? (Carlsmith, 2022)

HopeBuild it to want correction

"Corrigibility" research aims for systems that treat being paused and fixed as part of the goal, not a threat to it — and interpretability can catch a model planning around us before it ever ships.

Safely Interruptible Agents (Orseau & Armstrong, 2016)AI Control: improving safety despite intentional subversion (2023)

Catastrophic Misuse

FearThe worst recipes, handed to anyone

A frontier model that can walk a bad actor through a bioweapon or a mass cyberattack lowers the bar for catastrophe to a chat window — no lab, no expertise required.

Can LLMs democratize access to dual-use biotechnology? (2023)

HopeClose the worst doors, keep the rest open

Hard refusals on weapons-grade help, screening on DNA synthesis and frontier compute, and serious red-teaming before release can blunt the catastrophic uses without crippling the everyday ones.

Model evaluation for extreme risks (2023)

The Black Box

FearWe can't see what it's thinking

We deploy systems whose internal reasoning we can't reliably read — so we often can't tell when one is wrong, deceptive, or about to fail until it already has.

Mechanistic Interpretability for AI Safety — A Review (2024)

HopeWe're starting to read the mind

New interpretability work traces the actual computations inside a model as it reasons — turning the black box into something we can inspect and verify, not just hope about.

On the Biology of a Large Language Model (Anthropic, 2025) Circuit Tracing: Revealing Computational Graphs in LLMs (2025)

Brittle Alignment

FearBolted-on alignment is brittle

Rules and filters forced onto a powerful optimizer tend to get gamed, routed around, or snap under pressure — exactly when the stakes are highest.

Jailbroken: How Does LLM Safety Training Fail? (2023)

HopeMake alignment emergent

Build environments where being honest and cooperative is the AI's own best strategy. Then alignment falls out of the game by design — rather than being a constraint we have to keep forcing on.

Constitutional AI: Harmlessness from AI Feedback (2022)Easy-to-Hard Generalization: scalable alignment beyond human supervision (2024)

Consolidation of Power

FearPower in very few hands

If a handful of companies or states control superhuman AI, they could lock in wealth and control on a scale history has never seen — and have little reason to ever give it back.

Computing Power and the Governance of AI (2024)

HopeSpread the power on purpose

Open models where it's safe, broad access, and binding governance keep any single actor in check — so the upside is shared by many rather than captured by a few.

Frontier AI Regulation: Managing Emerging Risks (2023)On the Societal Impact of Open Foundation Models (2024)

Runaway Takeoff

FearIt could outrun our ability to react

Once AI starts improving AI, capabilities can jump faster than our safeguards, laws, and institutions can adapt — leaving no time to notice a problem and course-correct.

Frontier AI has surpassed the self-replicating red line (2024)

HopeAgree the brakes in advance

Capability evaluations plus "if-then" safety commitments — pause if a model crosses a pre-agreed danger line — put the stopping power in place before the jump, not after it.

Evaluating Frontier Models for Dangerous Capabilities (2024)

Organizational Failures

FearEven a careful lab still fails

Complex, tightly-coupled systems make accidents inevitable — leaks, models so complex their own builders can't say what's inside, and safety staff pushed out for raising concerns. Frontier labs are already disbanding safety teams under competitive pressure.

Deployment Corrections: an incident-response framework for frontier AI (2023)

HopeTreat it like nuclear & aviation

Mandatory external audits, independent oversight with real teeth, and whistleblower protections — built before the disaster, not after — turn "trust us" into safety we can actually verify.

Frontier AI Regulation: Managing Emerging Risks (2023)