📚 Paper Review - 2026-05-22

# 📚 Daily Paper Review - 2026-05-22

Found **10** relevant papers today. Please review and approve/reject.

---

## 1. iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

**Score:** `5.6/10` | **arXiv:** [2605.21431v1](http://arxiv.org/abs/2605.21431v1)

**Authors:** Jun Zheng, Zhengze Xu, Mengting Chen...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interac...

**Key Contributions:**
- Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one.
- While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments.
- This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21431v1) | [📥 PDF](https://arxiv.org/pdf/2605.21431v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 2. RoadTones: Tone Controllable Text Generation from Road Event Videos

**Score:** `5.1/10` | **arXiv:** [2605.21411v1](http://arxiv.org/abs/2605.21411v1)

**Authors:** Chirag Parikh, Siddhi Pravin Lipare, Ravi Kiran Sarvadevabhatla

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation...

**Key Contributions:**
- Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style.
- This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy.
- To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21411v1) | [📥 PDF](https://arxiv.org/pdf/2605.21411v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 3. Deformba: Vision State Space Model with Adaptive State Fusion

**Score:** `4.8/10` | **arXiv:** [2605.21308v1](http://arxiv.org/abs/2605.21308v1)

**Authors:** Hongyu Ke, Jack Morris, Yongkang Liu...

**Relevance:**
- 🎯 Field Match: 0.51/10 - Matches: segmentation
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined ge...

**Key Contributions:**
- State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities.
- However, their application to vision tasks remains challenging.
- First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21308v1) | [📥 PDF](https://arxiv.org/pdf/2605.21308v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 4. Divide and Contrast: Learning Robust Temporal Features without Augmentation

**Score:** `4.8/10` | **arXiv:** [2605.21241v1](http://arxiv.org/abs/2605.21241v1)

**Authors:** Abdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor

**Relevance:**
- 🎯 Field Match: 0.68/10 - Matches: self-supervised
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Self-supervised learning for time-series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics. In this work, we introduce Divide and Contrast (Di-COT), an unsupervised framework that avoids data augmentation and mult...

**Key Contributions:**
- Self-supervised learning for time-series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics.
- In this work, we introduce Divide and Contrast (Di-COT), an unsupervised framework that avoids data augmentation and multiple encoder passes by contrasting informative substructures within a window rather than individual timesteps.
- Di-COT stochastically partitions each window into a small number of overlapping sub-blocks per iteration, enabling efficient and meaningful contrast while mitigating false positives during temporal transitions.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21241v1) | [📥 PDF](https://arxiv.org/pdf/2605.21241v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 5. Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning

**Score:** `4.6/10` | **arXiv:** [2605.21475v1](http://arxiv.org/abs/2605.21475v1)

**Authors:** Yi Huang, Qingyun Sun, Jia Li...

**Relevance:**
- 🎯 Field Match: 0.42/10 - Matches: deep learning
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Relational prediction tasks are fundamental in many real-world applications, where data are naturally stored in relational databases (RDBs). Relational Deep Learning (RDL) addresses this problem by modeling RDBs as graphs and applying graph neural networks (GNNs) for end-to-end learning. However, the full-resolution property is commonly adopted as a design principle in graph construction for RDBs ...

**Key Contributions:**
- Relational prediction tasks are fundamental in many real-world applications, where data are naturally stored in relational databases (RDBs).
- Relational Deep Learning (RDL) addresses this problem by modeling RDBs as graphs and applying graph neural networks (GNNs) for end-to-end learning.
- However, the full-resolution property is commonly adopted as a design principle in graph construction for RDBs to preserve relational semantics, which leads most existing methods to rely on fixed graph structures.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21475v1) | [📥 PDF](https://arxiv.org/pdf/2605.21475v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 6. RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

**Score:** `4.6/10` | **arXiv:** [2605.21237v1](http://arxiv.org/abs/2605.21237v1)

**Authors:** Xuan Yang, Xiaohan Yuan, Hao Li...

**Relevance:**
- 🎯 Field Match: 0.76/10 - Matches: cardiac
- 🏆 Venue: MICCAI (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth th...

**Key Contributions:**
- Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases.
- Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence.
- Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21237v1) | [📥 PDF](https://arxiv.org/pdf/2605.21237v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 7. OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

**Score:** `4.4/10` | **arXiv:** [2605.21343v1](http://arxiv.org/abs/2605.21343v1)

**Authors:** Ziye Li, Henghui Ding

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often ...

**Key Contributions:**
- Recent layout-to-image models have achieved remarkable progress in spatial controllability.
- However, they still struggle with inter-object occlusion.
- When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21343v1) | [📥 PDF](https://arxiv.org/pdf/2605.21343v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 8. Let EEG Models Learn EEG

**Score:** `4.4/10` | **arXiv:** [2605.21280v1](http://arxiv.org/abs/2605.21280v1)

**Authors:** Yifan Wang, Yijia Ma, Wen Li...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often strug...

**Key Contributions:**
- High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling.
- Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity.
- As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21280v1) | [📥 PDF](https://arxiv.org/pdf/2605.21280v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 9. Data-Efficient Neural Operator Training via Physics-Based Active Learning

**Score:** `4.2/10` | **arXiv:** [2605.21348v1](http://arxiv.org/abs/2605.21348v1)

**Authors:** Alicja Polanska, Lorenzo Zanisi, Vignesh Gopakumar...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICLR (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements. Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner. We introduce physics-based acquisition - a novel physics-informed active learning algorithm that l...

**Key Contributions:**
- Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements.
- Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner.
- We introduce physics-based acquisition - a novel physics-informed active learning algorithm that leverages the partial differential equation residual to guide data selection.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21348v1) | [📥 PDF](https://arxiv.org/pdf/2605.21348v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 10. Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

**Score:** `4.2/10` | **arXiv:** [2605.21470v1](http://arxiv.org/abs/2605.21470v1)

**Authors:** Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We...

**Key Contributions:**
- Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser.
- Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use.
- We present agent just-in-time (JIT) compilation, an alternative that compiles task descriptions directly into executable code that is free to include LLM calls, tool calls, and parallelization.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.21470v1) | [📥 PDF](https://arxiv.org/pdf/2605.21470v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---


## How to Review

1. Read the summaries above
2. Check paper links for more details
3. Add labels to indicate your decision:
   - `approved` - Add to collection
   - `rejected` - Skip this paper
   - `starred` - Mark as particularly important
4. Comment "approve" or "reject" to trigger automation

**Note:** Papers with `approved` label will be automatically added to the collection.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📚 Paper Review - 2026-05-22 #240

📚 Daily Paper Review - 2026-05-22

1. iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

2. RoadTones: Tone Controllable Text Generation from Road Event Videos

3. Deformba: Vision State Space Model with Adaptive State Fusion

4. Divide and Contrast: Learning Robust Temporal Features without Augmentation

5. Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning

6. RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

7. OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

8. Let EEG Models Learn EEG

9. Data-Efficient Neural Operator Training via Physics-Based Active Learning

10. Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

How to Review

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

📚 Paper Review - 2026-05-22 #240

Description

📚 Daily Paper Review - 2026-05-22

1. iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

2. RoadTones: Tone Controllable Text Generation from Road Event Videos

3. Deformba: Vision State Space Model with Adaptive State Fusion

4. Divide and Contrast: Learning Robust Temporal Features without Augmentation

5. Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning

6. RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

7. OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

8. Let EEG Models Learn EEG

9. Data-Efficient Neural Operator Training via Physics-Based Active Learning

10. Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

How to Review

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions