optimizer field note
Muon vs Shampoo at \(\beta_2=0\)
Same momentum input, same sign direction, different graft magnitude.
\(Q_t=\operatorname{sign}(U_t)\). Newton-Schulz and Shampoo \(\beta_2=0\) target the same polar factor.
Square-root correction is \(a_\ell\) in \(\Delta W_\ell=-a_\ell Q_{t,\ell}\).
same direction, faster curve
Green and red share \(Q_t\); grafting sets scale.
preconditioning Newton-Schulz / Muon
Multiplications approximate the polar factor; no inverse root in the inner loop.
preconditioning Shampoo, \(\beta_2=0\)
On the nonzero singular subspace, matrix Shampoo also returns \(AB^\top\).
direction Equal component
\[ Q_t^{\rm NS}\approx Q_t^{\rm Shampoo}(\beta_2=0)=\operatorname{sign}(U_t)=AB^\top. \]grafting Different magnitude
Shampoo: \(a_\ell=\eta\sqrt{N_\ell}\)
Muon: \(a_\ell=\eta\sqrt{\max(1,r_\ell/c_\ell)}\)