Optimizer Mines

Chamber 1

Momentum Training Room

The sorcerer's first trick is to call every moving average "momentum." This vault separates the three spells: Nesterov lookahead, heavy-ball velocity, and EMA filtering. The rate plot uses the same stretched quadratic bowl so the condition number \(\kappa=L/\mu\) becomes visible instead of abstract.

Nesterov loss 0.0000 rate 1 - 1/sqrt(kappa)

current point lookahead / memory next update

Condition number trial

How kappa bends time

kappa=100

condition number kappa 100

GD 1 - 1/kappa Nesterov 1 - 1/sqrt(kappa) Heavy-ball quadratic optimum EMA beta=.9 spectral radius

At kappa=100, accelerated methods trade kappa for sqrt(kappa); EMA mainly smooths gradients.

Nesterov Momentum

Step 1 / 6

Start with the simplest ill-conditioned bowl: one slow direction with curvature \(\mu\) and one fast direction with curvature \(L\).

\[ f(u,v)=\frac{1}{2}\mu u^2+\frac{1}{2}Lv^2,\qquad \kappa=L/\mu. \]

With gradient descent and \(\eta=1/L\), the fast mode dies in one step, but the slow mode only shrinks by \(1-1/\kappa\).

Chamber 2

Grafting: Direction Is Not Magnitude

Grafting asks two optimizers a question. One optimizer contributes the step length, the other contributes the direction. In practice this is usually applied per tensor or per layer so each layer keeps its own trust scale.

direction slot magnitude slot grafted update

Direction source Magnitude source Layer scale

\[ \Delta W_\ell = \frac{\|u^{\rm mag}_\ell\|_F}{\|u^{\rm dir}_\ell\|_F+\epsilon} u^{\rm dir}_\ell \]

Direction is normalized. Magnitude is copied layer-wise. The final arrow keeps the blue angle and the pink length.

Layer-wise grafting makes this formula local to \(W_\ell\), so an attention matrix, an MLP matrix, and an embedding table can each keep a different magnitude schedule.

Chamber 3

Weight Decay Switchboard

Weight decay is a hidden slot in many optimizer names. The important fork is whether decay enters the gradient path or is a separate shrink of the weights.

Coupled L2

\[ g_t\leftarrow \nabla f(w_t)+\lambda w_t,\qquad w_{t+1}=w_t-\eta P_t g_t. \]

Adaptive preconditioners also precondition the decay term.

AdamW / SGDW style

\[ w_{t+1}=(1-\eta\lambda)w_t-\eta u_t. \]

The shrink is decoupled from the loss-gradient update but still scaled by the learning rate.

Fixed shrink

\[ w_{t+1}=(1-\lambda)w_t-\eta u_t. \]

This makes the shrink schedule independent of the current learning-rate schedule.

Game 1

Slot Forge: Fill The Update Rule

The mine names an optimizer and describes the spell. Choose the correct slots. When the slots match, the forge reveals the full update rule and pays Cookie points.

Memory?

Direction?

Magnitude?

Decay?

Round 1 / 8

Seal the target spell

The forge accepts four runes: memory, direction, magnitude, and decay.

Target spell

AdamW

Adaptive diagonal direction with EMA moments and decoupled learning-rate-scaled weight decay.

\[ w_{t+1}=(1-\eta\lambda)w_t-\eta \frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon} \]

Memory

Direction

Magnitude

Decay

Your update rule

Select one option in every slot to reveal the optimizer spell.

Pick the four slots that match the target spell.

Game 2

Name The Filled Slots

The mine fills the slots for a real optimizer variant. Pick the right name to win Cookie points. Some rules have multiple accepted names, including spectral descent, Shampoo beta2=0, and Muon-style sign directions.

memory-

direction-

magnitude-

decay-

Which optimizer is this?

Choose a gate name.

Final cutscene

The Optimizer Is Free

The bars crack. The slop retreats from the kingdom. The freed optimizer carries the clean update rules back to the deep learning field.

Credits

Momentum vault cleared: Nesterov, heavy-ball, EMA.

Grafting hall cleared: direction separated from layer-wise magnitude.

Forge cleared: Adam, NAdam, Shampoo, Muon, spectral descent, and hybrids.

Slop warning: a benchmark without tuning rules and held-out checks is a trap room.

The kingdom is not saved by vibes. It is saved by update rules.

Source chamber

Source Trail

Agarwal, Anil, Hazan, Koren, Zhang: grafting decouples update magnitude from update direction; layer-wise grafting applies this per parameter group.
Gupta, Koren, Singer: Shampoo maintains per-mode preconditioners and gives stochastic convex guarantees for tensor optimization.
Anil, Gupta, Koren, Regan, Singer: scalable Shampoo replaces expensive spectral decompositions with iterative inverse-root methods and pipelines stale preconditioners.
Loshchilov and Hutter: AdamW decouples weight decay from the adaptive gradient path.
Muon notes and follow-up papers: Muon forms a matrix direction by applying Newton-Schulz style polar iterations to a momentum matrix.