Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch
- Parallax is the new Local Linear Attention that keeps Softmax but adds a learned covariance correction. It replaces LLA’s per-query solver with a projector, doubling arithmetic intensity on Hopper GPUs without extra I/O. It beats FlashAttention 2/3 in decode and improves perplexity at 0.6B/1.7B scales. But here’s the kicker: it only works with Muon optimizer. Under AdamW, the gains vanish because the model suppresses the correction branch. So, if you’re still using AdamW, this is just expensive math for nothing. The moonboys will hype it anyway, but until Muon becomes standard, Parallax is just a pretty trick for researchers who actually read their loss curves.