📚 Daily Paper Review - 2026-05-23
Found 10 relevant papers today. Please review and approve/reject.
1. MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
Score: 5.4/10 | arXiv: 2605.22818v1
Authors: Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: ICML (10/10)
- 💻 Code: ✅ Available
AI Summary:
Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation p...
Key Contributions:
- Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete.
- Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences.
- To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
2. AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Score: 5.2/10 | arXiv: 2605.22816v1
Authors: Wenxuan Guo, Xiuwei Xu, Yichen Liu...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ✅ Available
AI Summary:
Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. ...
Key Contributions:
- Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment.
- While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene.
- Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
3. Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Score: 4.9/10 | arXiv: 2605.22809v1
Authors: Jiahao Wang, Bo Sun, Yijing Bai...
Relevance:
- 🎯 Field Match: 0.85/10 - Matches: gaussian splatting
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ❌ Not mentioned
AI Summary:
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturi...
Key Contributions:
- Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets.
- Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage.
- In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
4. SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
Score: 4.9/10 | arXiv: 2605.22658v1
Authors: Zhenyu Lu, Liupeng Li, Jinpeng Wang...
Relevance:
- 🎯 Field Match: 0.51/10 - Matches: segmentation
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ❌ Not mentioned
AI Summary:
While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc ...
Key Contributions:
- While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception.
- Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes".
- Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
5. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Score: 4.6/10 | arXiv: 2605.22645v1
Authors: Hanjun Luo, Zhimu Huang, Sylvia Chung...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned
AI Summary:
Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prom...
Key Contributions:
- Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts.
- Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured.
- We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
6. Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
Score: 4.5/10 | arXiv: 2605.22767v1
Authors: Ganlin Feng, Yuxi Long, Erin Lou...
Relevance:
- 🎯 Field Match: 0.42/10 - Matches: computer vision
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ❌ Not mentioned
AI Summary:
Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. ...
Key Contributions:
- Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings.
- These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling.
- While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
7. From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
Score: 4.4/10 | arXiv: 2605.22671v1
Authors: Bing Hu, Zaijing Li, Rui Shao...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned
AI Summary:
Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignm...
Key Contributions:
- Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments.
- While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios.
- To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
8. Improving Viewpoint-Invariance and Temporal Consistency for Action Detection
Score: 4.3/10 | arXiv: 2605.22695v1
Authors: Yannick Porto, Renato Martins, Thomas Chalumeau...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: None (5.0/10)
- 💻 Code: ✅ Available
AI Summary:
Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion w...
Key Contributions:
- Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos.
- Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows.
- This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
9. Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions
Score: 4.2/10 | arXiv: 2605.22697v1
Authors: Yannick Porto, Renato Martins, Thomas Chalumeau...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: None (5.0/10)
- 💻 Code: ✅ Available
AI Summary:
Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remain...
Key Contributions:
- Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training.
- In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge.
- Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
10. Claw AI Lab: An Autonomous Multi-Agent Research Team
Score: 4.2/10 | arXiv: 2605.22662v1
Authors: Fan Wu, Cheng Chen, Zhenshan Tan...
Relevance:
- 🎯 Field Match: 0.0/10 - Matches:
- 🏆 Venue: None (5.0/10)
- 💻 Code: ✅ Available
AI Summary:
We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, arti...
Key Contributions:
- We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory.
- Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, artifact inspection, and rollback/resume control through a unified dashboard.
- The platform also supports distinct research modes for exploration, multi-agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory-like in practice.
Links: 📄 Paper | 📥 PDF
Actions:
- ✅ Approve: Add label
approved and comment "approve"
- ❌ Reject: Add label
rejected and comment "reject"
- ⭐ Important: Add label
starred
How to Review
- Read the summaries above
- Check paper links for more details
- Add labels to indicate your decision:
approved - Add to collection
rejected - Skip this paper
starred - Mark as particularly important
- Comment "approve" or "reject" to trigger automation
Note: Papers with approved label will be automatically added to the collection.
📚 Daily Paper Review - 2026-05-23
Found 10 relevant papers today. Please review and approve/reject.
1. MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
Score:
5.4/10| arXiv: 2605.22818v1Authors: Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei...
Relevance:
AI Summary:
Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation p...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred2. AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Score:
5.2/10| arXiv: 2605.22816v1Authors: Wenxuan Guo, Xiuwei Xu, Yichen Liu...
Relevance:
AI Summary:
Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. ...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred3. Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Score:
4.9/10| arXiv: 2605.22809v1Authors: Jiahao Wang, Bo Sun, Yijing Bai...
Relevance:
AI Summary:
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturi...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred4. SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
Score:
4.9/10| arXiv: 2605.22658v1Authors: Zhenyu Lu, Liupeng Li, Jinpeng Wang...
Relevance:
AI Summary:
While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc ...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred5. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Score:
4.6/10| arXiv: 2605.22645v1Authors: Hanjun Luo, Zhimu Huang, Sylvia Chung...
Relevance:
AI Summary:
Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prom...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred6. Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
Score:
4.5/10| arXiv: 2605.22767v1Authors: Ganlin Feng, Yuxi Long, Erin Lou...
Relevance:
AI Summary:
Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. ...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred7. From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
Score:
4.4/10| arXiv: 2605.22671v1Authors: Bing Hu, Zaijing Li, Rui Shao...
Relevance:
AI Summary:
Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignm...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred8. Improving Viewpoint-Invariance and Temporal Consistency for Action Detection
Score:
4.3/10| arXiv: 2605.22695v1Authors: Yannick Porto, Renato Martins, Thomas Chalumeau...
Relevance:
AI Summary:
Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion w...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred9. Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions
Score:
4.2/10| arXiv: 2605.22697v1Authors: Yannick Porto, Renato Martins, Thomas Chalumeau...
Relevance:
AI Summary:
Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remain...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starred10. Claw AI Lab: An Autonomous Multi-Agent Research Team
Score:
4.2/10| arXiv: 2605.22662v1Authors: Fan Wu, Cheng Chen, Zhenshan Tan...
Relevance:
AI Summary:
We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, arti...
Key Contributions:
Links: 📄 Paper | 📥 PDF
Actions:
approvedand comment "approve"rejectedand comment "reject"starredHow to Review
approved- Add to collectionrejected- Skip this paperstarred- Mark as particularly importantNote: Papers with
approvedlabel will be automatically added to the collection.