📚 Paper Review - 2026-06-01

# 📚 Daily Paper Review - 2026-06-01

Found **10** relevant papers today. Please review and approve/reject.

---

## 1. City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

**Score:** `5.8/10` | **arXiv:** [2605.30310v1](http://arxiv.org/abs/2605.30310v1)

**Authors:** Sayan Paul, Sourav Ghosh, Siddharth Katageri...

**Relevance:**
- 🎯 Field Match: 2.03/10 - Matches: gaussian splatting, 3d reconstruction, nerf
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-s...

**Key Contributions:**
- City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes.
- Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc.
- often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30310v1) | [📥 PDF](https://arxiv.org/pdf/2605.30310v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 2. Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

**Score:** `5.7/10` | **arXiv:** [2605.30342v1](http://arxiv.org/abs/2605.30342v1)

**Authors:** Shangjie Xue, Jesse Dill, Dhruv Ahuja...

**Relevance:**
- 🎯 Field Match: 1.69/10 - Matches: 3d gaussian, gaussian splatting
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility ...

**Key Contributions:**
- We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS.
- Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS.
- To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30342v1) | [📥 PDF](https://arxiv.org/pdf/2605.30342v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 3. Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

**Score:** `5.3/10` | **arXiv:** [2605.30231v1](http://arxiv.org/abs/2605.30231v1)

**Authors:** Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometr...

**Key Contributions:**
- Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning.
- Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome.
- In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30231v1) | [📥 PDF](https://arxiv.org/pdf/2605.30231v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 4. Archon: A Unified Multimodal Model for Holistic Digital Human Generation

**Score:** `4.9/10` | **arXiv:** [2605.30311v1](http://arxiv.org/abs/2605.30311v1)

**Authors:** Chong Bao, Shichen Liu, Lijun Yu...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autore...

**Key Contributions:**
- Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge.
- In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation.
- Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30311v1) | [📥 PDF](https://arxiv.org/pdf/2605.30311v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 5. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

**Score:** `4.8/10` | **arXiv:** [2605.30353v1](http://arxiv.org/abs/2605.30353v1)

**Authors:** Nhat-Minh Nguyen

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ✅ Available

**AI Summary:**
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level.
  The agent resolved ten autonomously b...

**Key Contributions:**
- Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX.
- We documented and classified 15 supervision events by intervention level.
- The agent resolved ten autonomously by iterating against oracle tests.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30353v1) | [📥 PDF](https://arxiv.org/pdf/2605.30353v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 6. Benchmarking Single-Factor Physical Video-to-Audio Generation

**Score:** `4.5/10` | **arXiv:** [2605.30339v1](http://arxiv.org/abs/2605.30339v1)

**Authors:** Tingle Li, Siddharth Gururani, Kevin J. Shih...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled coun...

**Key Contributions:**
- Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes.
- Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions.
- In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30339v1) | [📥 PDF](https://arxiv.org/pdf/2605.30339v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 7. Grounded 3D-Aware Spatial Vision-Language Modeling

**Score:** `4.4/10` | **arXiv:** [2605.30307v1](http://arxiv.org/abs/2605.30307v1)

**Authors:** An-Chieh Cheng, Yang Fu, Yatai Ji...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: CVPR (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to refere...

**Key Contributions:**
- We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework.
- GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses.
- In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30307v1) | [📥 PDF](https://arxiv.org/pdf/2605.30307v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 8. iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

**Score:** `4.4/10` | **arXiv:** [2605.30179v1](http://arxiv.org/abs/2605.30179v1)

**Authors:** Yang Song, Yixuan Zhang, Lingfa Meng...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-cond...

**Key Contributions:**
- Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels.
- We introduce iLoRA.
- To our knowledge, it is the first Bayesian graph-conditioned LoRA framework.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30179v1) | [📥 PDF](https://arxiv.org/pdf/2605.30179v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 9. MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

**Score:** `4.4/10` | **arXiv:** [2605.30295v1](http://arxiv.org/abs/2605.30295v1)

**Authors:** Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic H...

**Key Contributions:**
- Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited.
- Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems.
- We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30295v1) | [📥 PDF](https://arxiv.org/pdf/2605.30295v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---

## 10. Do Language Models Track Entities Across State Changes?

**Score:** `4.3/10` | **arXiv:** [2605.30233v1](http://arxiv.org/abs/2605.30233v1)

**Authors:** Zilu Tang, Qiao Zhao, Gabriel Franco...

**Relevance:**
- 🎯 Field Match: 0.0/10 - Matches: 
- 🏆 Venue: ICML (10/10)
- 💻 Code: ❌ Not mentioned

**AI Summary:**
Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, ...

**Key Contributions:**
- Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning.
- An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes.
- However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language.

**Links:** [📄 Paper](http://arxiv.org/abs/2605.30233v1) | [📥 PDF](https://arxiv.org/pdf/2605.30233v1)

**Actions:**
- ✅ Approve: Add label `approved` and comment "approve"
- ❌ Reject: Add label `rejected` and comment "reject"
- ⭐ Important: Add label `starred`

---


## How to Review

1. Read the summaries above
2. Check paper links for more details
3. Add labels to indicate your decision:
   - `approved` - Add to collection
   - `rejected` - Skip this paper
   - `starred` - Mark as particularly important
4. Comment "approve" or "reject" to trigger automation

**Note:** Papers with `approved` label will be automatically added to the collection.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📚 Paper Review - 2026-06-01 #250

📚 Daily Paper Review - 2026-06-01

1. City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

2. Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

3. Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

4. Archon: A Unified Multimodal Model for Holistic Digital Human Generation

5. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

6. Benchmarking Single-Factor Physical Video-to-Audio Generation

7. Grounded 3D-Aware Spatial Vision-Language Modeling

8. iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

9. MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

10. Do Language Models Track Entities Across State Changes?

How to Review

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

📚 Paper Review - 2026-06-01 #250

Description

📚 Daily Paper Review - 2026-06-01

1. City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

2. Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

3. Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

4. Archon: A Unified Multimodal Model for Holistic Digital Human Generation

5. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

6. Benchmarking Single-Factor Physical Video-to-Audio Generation

7. Grounded 3D-Aware Spatial Vision-Language Modeling

8. iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

9. MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

10. Do Language Models Track Entities Across State Changes?

How to Review

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions