Hi, thanks on your great work on vision model interpretations! I also love your work and findings in the latest "Into the Rabbit Hull" paper!
Yet I'm wondring if it's do-able to process embeddings from feed-forward 3D backbones (e.g. mast3r family, vggt etc.) which are also attention-based models, and it's supre interesting to reveal what 3D concepts do they learned. Is there anything to pay attention to when swtich to 3D vision models?
Meanwhile, I'll trying to train a RA-SAE on VGGT's embeddings (the attention layer output) and I'm glad to share it if anyone find it interesting.