Provide Q&A/Wiki for this project

Looped Transformers seem interesting, I got a few questions

- Would BitNet/Ternary LM be compatible with this design?
- Can this be accelerated with linear attention / DeltaNet / SSM?
- Would embedding scaling like LongCat/Gemma yield better results?
- Are Compressed Attention like the ones in DeepSeek-V4 applicable?
- Is it feasible to start caring about interleaved thinking or RLMs?
- Would HPO or switching optimizer like Muon make training faster?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Q&A/Wiki for this project #68

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Provide Q&A/Wiki for this project #68

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions