Skip to content

Provide Q&A/Wiki for this project #68

Description

@TomLucidor

Looped Transformers seem interesting, I got a few questions

  • Would BitNet/Ternary LM be compatible with this design?
  • Can this be accelerated with linear attention / DeltaNet / SSM?
  • Would embedding scaling like LongCat/Gemma yield better results?
  • Are Compressed Attention like the ones in DeepSeek-V4 applicable?
  • Is it feasible to start caring about interleaved thinking or RLMs?
  • Would HPO or switching optimizer like Muon make training faster?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions