Looped Transformers seem interesting, I got a few questions
- Would BitNet/Ternary LM be compatible with this design?
- Can this be accelerated with linear attention / DeltaNet / SSM?
- Would embedding scaling like LongCat/Gemma yield better results?
- Are Compressed Attention like the ones in DeepSeek-V4 applicable?
- Is it feasible to start caring about interleaved thinking or RLMs?
- Would HPO or switching optimizer like Muon make training faster?
Looped Transformers seem interesting, I got a few questions