ICCV 2025 Schedule

Filter Events

SUN 19 OCT

7 a.m.

Break:

Breakfast

(ends 9:00 AM)

Registration / Badge Pickup

(ends 5:00 PM)

8 a.m.

Workshop:

Camera Calibration and Pose Estimation

(ends 12:30 PM)

Workshop:

Learning to See: Advancing Spatial Understanding for Embodied Intelligence

(ends 5:00 PM)

Workshop:

Systematic Trust in AI Models: Ensuring Fairness; Reliability; Explainability; and Accountability in Machine Learning Frameworks

(ends 12:00 PM)

Workshop:

Computer Vision for Fashion; Art; and Design: Bridging Creativity and Responsible AI

(ends 12:30 PM)

Workshop:

Workshop on Benchmarking Multi-Target Tracking: Towards Spatiotemporal Action Grounding in Videos

(ends 12:00 PM)

Workshop:

Vision-based AI for Digital Health: From Pixels to Practics

(ends 5:00 PM)

Workshop:

2nd Workshop on the Challenge Of Out-of-Label Hazard Detection in Autonomous Driving

(ends 5:00 PM)

Workshop:

Affective & Behavior Analysis in-the-wild

(ends 12:30 PM)

Workshop:

The 1st International Workshop and Challenge on Disentangled Representation Learning for Real-world Applications

(ends 12:00 PM)

Workshop:

2nd Workshop on Explainable Computer Vision: Quo Vadis?

(ends 5:00 PM)

Workshop:

Foundations Models for V2X-Based Cooperative Autonomous Driving

(ends 6:00 PM)

Workshop:

Interactive Human-centric Foundation Models

(ends 5:00 PM)

Workshop:

The 4th DataCV Workshop and Challenge

(ends 12:00 PM)

Workshop:

Authenticity & Provenance in the age of Generative AI

(ends 5:00 PM)

Tutorial:

Towards Comprehensive Reasoning in Vision-Language Models

(ends 12:00 PM)

Workshop:

3rd Workshop on Computer Vision for Automated Medical Diagnosis

(ends 6:00 PM)

8:15 a.m.

Workshop:

Workshop on Computer Vision with Single-Photon Cameras

(ends 12:00 PM)

8:30 a.m.

Workshop:

Memory and Vision

(ends 12:30 PM)

Workshop:

3rd Workshop on Vision-based InduStrial InspectiON

(ends 6:00 PM)

Workshop:

The Eighth International Workshop on Computer Vision for Physiological Measurement (CVPM)

(ends 12:00 PM)

Workshop:

Geometry-Free Novel View Synthesis and Controllable Video Models

(ends 6:00 PM)

Workshop:

Instance-Level Recognition and Generation Workshop

(ends 12:30 PM)

8:45 a.m.

Tutorial:

Beyond Self-Driving: Exploring Three Levels of Driving Automation

(ends 12:15 PM)

Workshop:

RetailVision6 - Revolutionizing the World of Retail

(ends 6:00 PM)

8:50 a.m.

Workshop:

Story-Level Movie Understanding and Audio Description

(ends 12:30 PM)

Workshop:

The Challenge of Detecting Synthetic Manipulations in ID Documents

(ends 12:10 PM)

Workshop:

Structural Priors for Vision

(ends 5:30 PM)

8:55 a.m.

Workshop:

Generative AI for Audio-Visual Content Creation

(ends 12:00 PM)

9 a.m.

Workshop:

Workshop on Biomedical Image and Signal Computing for Unbiasedness; Interpretability; and Trustworthiness

(ends 12:30 PM)

Workshop:

Multi-modal Localization and Mapping

(ends 5:00 PM)

Workshop:

Human-Interactive Generation and Editing

(ends 6:00 PM)

Workshop:

Embedded Vision Workshop

(ends 5:30 PM)

Workshop:

The 2nd AI for Visual Arts Workshop and Challenges

(ends 12:00 PM)

Workshop:

2nd Beyond Euclidean Workshop: Hyperbolic and Hyperspherical Learning for Computer Vision

(ends 6:00 PM)

Workshop:

The Third Perception Test Challenge

(ends 5:00 PM)

Tutorial:

A Tour Through AI-powered Photography and Imaging

(ends 12:30 PM)

Workshop:

Workshop on Safe and Trustworthy Multimodal AI Systems

(ends 5:00 PM)

Workshop:

Multispectral Imaging for Robotics and Automation

(ends 5:00 PM)

Workshop:

Foundation Data for Industrial Tech Transfer

(ends 12:30 PM)

Tutorial:

3D Human Motion Generation and Simulation

(ends 5:00 PM)

Workshop:

Joint Workshop on Marine Vision

(ends 6:00 PM)

Workshop:

Driving Simulation from Real-World Data: How Well Can We Render and Drive?

(ends 12:00 PM)

Tutorial:

Foundation Models in Visual Anomaly Detection: Advances, Challenges, and Applications

(ends 12:00 PM)

Workshop:

Binocular Egocentric-360 Multi-modal Scene Understanding in the Wild

(ends 12:30 PM)

Workshop:

Computer Vision for Developing Countries

(ends 5:00 PM)

Tutorial:

Learning Deep Low-Dimensional Models from High-Dimensional Data: From Theory to Practice

(ends 5:00 PM)

9:15 a.m.

Workshop:

Artificial Social Intelligence Workshop

(ends 5:30 PM)

10 a.m.

Break:

Coffee Break & Posters

(ends 11:00 AM)

noon

Break:

Lunch

(ends 1:45 PM)

1 p.m.

Workshop:

Workshop on Knowledge-Intensive Multimodal Reasoning

(ends 5:00 PM)

Workshop:

10th International Workshop on Recovering 6D Object Pose

(ends 5:00 PM)

Workshop:

Visual Quality Assessment Competition

(ends 5:00 PM)

Workshop:

Women in Computer Vision

(ends 5:00 PM)

Workshop:

Representation Learning with Very Limited Resources: When Data; Modalities; Labels; and Computing Resources are Scarce

(ends 6:00 PM)

Workshop:

The 2nd Workshop on Efficient Computing under Limited Resources: Visual Computing

(ends 5:00 PM)

Workshop:

The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection

(ends 5:00 PM)

Tutorial:

Responsible Vision-Language Generative Models

(ends 5:00 PM)

Workshop:

Comic Intelligence Quotient: Advances and Challenges in AI-driven Comic Analysis

(ends 5:00 PM)

Workshop:

Multi-Modal Foundation Models for Cancer Detection and Prevention

(ends 5:00 PM)

Workshop:

Short-Form Video Understanding: The Next Frontier in Video Intelligence

(ends 5:00 PM)

Workshop:

Workshop on Graphic Design Understanding and Generation

(ends 5:00 PM)

Workshop:

SEA: 1st workshop on Sustainability with Earth observation and AI

(ends 5:00 PM)

Workshop:

2nd Workshop and Challenge on Unlearning and Model Editing

(ends 6:00 PM)

Workshop:

End-to-End 3D Learning

(ends 5:00 PM)

Workshop:

Multimodal Continual Learning

(ends 5:00 PM)

Tutorial:

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

(ends 5:00 PM)

Workshop:

Transparent & Reflective objects In the wild Challenges

(ends 5:00 PM)

Tutorial:

Foundation Models for 3D Asset Synthesis.

(ends 5:00 PM)

Workshop:

Neural SLAM Workshop

(ends 5:00 PM)

1:30 p.m.

Workshop:

13th International Workshop on Assistive Computer Vision and Robotics

(ends 5:30 PM)

Workshop:

5th Workshop and Challenge on Open-World 3D Scene Understanding

(ends 6:00 PM)

2 p.m.

Workshop:

First Workshop on Skilled Activity Understanding; Assessment and Feedback Generation

(ends 6:00 PM)

3 p.m.

Break:

Coffee Break & Posters

(ends 4:00 PM)

MON 20 OCT

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8 a.m.

Workshop:

Workshop on Neuromorphic Vision (NeVi): Advantages and Applications of Event Cameras

(ends 12:00 PM)

Workshop:

The 12th IEEE International Workshop on Analysis and Modeling of Faces and Gestures

(ends 5:00 PM)

Workshop:

Multi-Modal Reasoning for Agentic Intelligence

(ends 5:00 PM)

Workshop:

Closing the Loop Between Vision and Language (Decade Mark)

(ends 5:00 PM)

Workshop:

Computer Vision for Materials Science

(ends 12:30 PM)

Workshop:

Embodied Spatial Reasoning

(ends 12:00 PM)

Workshop:

Computer Vision in Advertising and Marketing

(ends 5:00 PM)

Workshop:

Workshop on Advanced Perception for Autonomous Healthcare

(ends 12:30 PM)

Workshop:

What is Next in Multimodal Foundation Models?

(ends 12:00 PM)

Workshop:

Multimodal reasoning and slow thinking in large model era: towards system2 and beyond

(ends 5:00 PM)

Workshop:

9th AI City Challenge

(ends 5:30 PM)

Workshop:

Workshop on Curated Data for Efficient Learning

(ends 5:00 PM)

Workshop:

Foundation & Generative Models in Biometrics

(ends 12:30 PM)

Workshop:

6th Workshop on Continual Learning in Computer Vision

(ends 5:00 PM)

Workshop:

Large-scale Video Object Segmentation

(ends 12:00 PM)

Workshop:

Robust and Interactable World Models in Computer Vision

(ends 5:00 PM)

Workshop:

The Third Workshop on AI for 3D Content Creation

(ends 12:30 PM)

Workshop:

Workshop on Distillation of Foundation Models for Autonomous Driving

(ends 12:00 PM)

Workshop:

Advances in Image Manipulation Workshop and Challenges

(ends 6:00 PM)

Workshop:

Computer Vision for Biometrics; Identity & Behavior Science

(ends 5:00 PM)

Tutorial:

Fourth Hands-on Egocentric Research Tutorial with Project Aria, from Meta

(ends 12:00 PM)

Workshop:

Vision-Language Modeling in 3D Medical Imaging

(ends 5:00 PM)

Workshop:

Generative AI for Biomedical Image Analysis: Opportunities; Challenges and Futures

(ends 12:00 PM)

Workshop:

Generative AI for Storytelling

(ends 12:00 PM)

Workshop:

Ego-Exo Sensing for Smart Mobility

(ends 5:00 PM)

Workshop:

Wild3D: 3D Modeling; Reconstruction; and Generation in the Wild

(ends 5:00 PM)

Workshop:

Generating Digital Twins from Images and Videos

(ends 5:00 PM)

Workshop:

Personalization in Generative AI Workshop

(ends 5:00 PM)

Tutorial:

Foundations of Interpretable AI

(ends 12:00 PM)

Workshop:

Workshop on Cultural Continuity of Artists: Leveraging Artistic Legacies for AI-Driven Cultural Heritage

(ends 12:00 PM)

8:10 a.m.

Workshop:

Mobile Intelligent Photography and Imaging

(ends 12:10 PM)

8:15 a.m.

Workshop:

The 3rd workshop on Binary and Extreme Quantization for Computer Vision

(ends 12:00 PM)

8:25 a.m.

Workshop:

Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities

(ends 6:00 PM)

Workshop:

Anomaly Detection with Foundation Models

(ends 12:15 PM)

8:30 a.m.

Workshop:

From street to space: 3D Vision AcrosS altiTudes

(ends 12:00 PM)

Workshop:

Workshop & Competition on Computationally Optimal Gaussian Splatting

(ends 12:00 PM)

Workshop:

Fairness and Ethics in AI: facing the ChalLEnge through Model Debiasing

(ends 12:30 PM)

Workshop:

The Second Workshop on Multimodal Representation and Retrieval

(ends 12:30 PM)

Workshop:

Human-inspired Computer Vision

(ends 12:30 PM)

Workshop:

1st Workshop on Multimodal Sign Language Recognition

(ends 6:00 PM)

Workshop:

Generative Scene Completion for Immersive Worlds

(ends 12:30 PM)

8:45 a.m.

Workshop:

Findings of the ICCV

(ends 5:00 PM)

9 a.m.

Workshop:

1st Workshop and Challenge on Category-Level Object Pose Estimation for Robotic Manipulation

(ends 5:00 PM)

Tutorial:

CANCELED: From Segment Anything to Generalized Visual Grounding

(ends 12:30 PM)

Workshop:

BioImage Computing

(ends 5:00 PM)

Workshop:

2nd Workshop on Computer Vision for Ecology

(ends 5:00 PM)

Workshop:

2nd AI for Content Generation; Quality Enhancement and Streaming

(ends 5:30 PM)

10 a.m.

Break:

Coffee Break & Posters

(ends 11:00 AM)

noon

Break:

Lunch

(ends 1:45 PM)

1 p.m.

Workshop:

1st Workshop on Long Multi-Scene Video Foundations: Generation; Understanding and Evaluation

(ends 5:00 PM)

Tutorial:

RANSAC in 2025

(ends 5:00 PM)

Workshop:

Workshop on Computer Vision Systems for Document Analysis and Recognition

(ends 5:00 PM)

Workshop:

Critical Evaluation of Generative Models and their Impact on Society

(ends 5:00 PM)

Workshop:

2nd Workshop on Audio-Visual Generation and Learning

(ends 5:00 PM)

Workshop:

Multimodal Spatial Intelligence

(ends 5:00 PM)

Workshop:

PHAROS - Adaptation; Fairness; Explainability in AI Medical Imaging

(ends 6:00 PM)

Workshop:

Biometrics for Art

(ends 5:00 PM)

Workshop:

Egocentric Body Motion Tracking; Synthesis and Action Recognition

(ends 5:00 PM)

Tutorial:

Towards Safe Multi-Modal Learning: Unique Challenges and Future Directions

(ends 5:00 PM)

Workshop:

Responsible Imaging

(ends 5:00 PM)

Workshop:

TrustFM: Workshop on Trustworthy Foundation Models

(ends 6:00 PM)

Tutorial:

Foundation Models Meet Embodied Agents

(ends 5:00 PM)

Workshop:

Computer Vision in Plant Phenotyping and Agriculture

(ends 5:00 PM)

Workshop:

UniLight: Unifying Evaluation Metrics for Image-based Lighting; Relighting and Compositing

(ends 5:00 PM)

Workshop:

Large Scale Cross Device Localization

(ends 5:00 PM)

1:30 p.m.

Workshop:

Workshop on Scene Graphs and Graph Representation Learning

(ends 5:30 PM)

Workshop:

Human-Robot-Scene Interaction and Collaboration

(ends 6:00 PM)

Workshop:

Visual object tracking and segmentation challenge workshop

(ends 5:40 PM)

2 p.m.

Workshop:

International Workshop on Observing and Understanding Hands in Action

(ends 6:00 PM)

Workshop:

2nd Workshop on Scalable 3D Generation and 3D Geometric Scene Understanding

(ends 6:00 PM)

3 p.m.

Break:

Coffee Break & Posters

(ends 4:00 PM)

TUE 21 OCT

7 a.m.

Break:

Breakfast

(ends 9:00 AM)

Registration/Badge Pickup

(ends 5:00 PM)

8 a.m.

Reception:

Welcome & Awards

(ends 8:45 AM)

8:45 a.m.

Oral 1A: Multi-modal learning [8:45-10:15]

Orals 9:00-10:15

[9:00] GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

[9:15] Scaling Laws for Native Multimodal Models

[9:30] FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

[9:45] Differentiable Room Acoustic Rendering with Multi-View Vision Priors

[10:00] Token Activation Map to Visually Explain Multimodal LLMs

(ends 10:15 AM)

Oral 1B: Structure and Motion [8:45-10:15]

Orals 9:00-10:15

[9:00] Multi-View 3D Point Tracking

[9:15] Uncalibrated Structure from Motion on a Sphere

[9:30] Removing Cost Volumes from Optical Flow Estimators

[9:45] Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image

[10:00] TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

(ends 10:15 AM)

10 a.m.

Break:

Coffee Break

(ends 11:00 AM)

10:15 a.m.

Keynote:

Taking pictures and making movies of black holes

Sheperd Doeleman

(ends 11:15 AM)

11:30 a.m.

Demonstration:

Demos 1

(ends 1:30 PM)

Break:

Lunch

(ends 1:30 PM)

11:45 a.m.

Poster Session 1 & Exhibit Hall [11:45-1:45]

Posters 11:45-1:45

Secure On-Device Video OOD Detection Without Backpropagation

Learning Counterfactually Decoupled Attention for Open-World Model Attribution

Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning

Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning

One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

X-Fusion: Introducing New Modality to Frozen Large Language Models

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

TITAN: Query-Token based Domain Adaptive Adversarial Learning

StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction

LMM-Det: Make Large Multimodal Models Excel in Object Detection

Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

CIARD: Cyclic Iterative Adversarial Robustness Distillation

Moderating the Generalization of Score-based Generative Model

Scaling Language-Free Visual Representation Learning

LLM-assisted Entropy-based Adaptive Distillation for Unsupervised Fine-grained Visual Representation Learning

InfoBridge: Balanced Multimodal Integration through Conditional Dependency Modeling

A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention

Meta-Learning Dynamic Center Distance: Hard Sample Mining for Learning with Noisy Labels

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

Multimodal Large Language Model-Guided ISP Hyperparameter Optimization with Dynamic Preference Learning

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention

Boundary Probing for Input Privacy Protection When Using LMM Services

Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers

ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Dataset Distillation as Data Compression: A Rate-Utility Perspective

Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Open-set Cross Modal Generalization via Multimodal Unified Representation

Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization

Adversarial Robust Memory-Based Continual Learner

NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning

A Unified Framework to BRIDGE Complete and Incomplete Deep Multi-View Clustering under Non-IID Missing Patterns

HumorDB: Can AI understand graphical humor?

GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability

Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness

Mitigating Object Hallucinations via Sentence-Level Early Intervention

Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning.

Unknown Text Learning for CLIP-based Few-Shot Open-set Recognition

One-Shot Knowledge Transfer for Scalable Person Re-Identification

ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning

PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection

Open-Unfairness Adversarial Mitigation for Generalized Deepfake Detection

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Spatial Preference Rewarding for MLLMs Spatial Understanding

EA-KD: Entropy-based Adaptive Knowledge Distillation

Structured Policy Optimization: Enhance Large Vision-Language Model via Self-referenced Dialogue

Seal Your Backdoor with Variational Defense

Semi-ViM: Bidirectional State Space Model for Mitigating Label Imbalance in Semi-Supervised Learning

CODE-CL: Conceptor-Based Gradient Projection for Deep Continual Learning

SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation

Beyond the Limits: Overcoming Negative Correlation of Activation-Based Training-Free NAS

Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning

I Am Big, You Are Little; I Am Right, You Are Wrong

Semi-supervised Deep Transfer for Regression without Domain Alignment

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

CVPT: Cross Visual Prompt Tuning

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Learning Separable Fine-Grained Representation via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition

Diffusion Guided Adaptive Augmentation for Generalization in Visual Reinforcement Learning

Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Multi-View 3D Point Tracking

Removing Cost Volumes from Optical Flow Estimators

CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training

Fast Globally Optimal and Geometrically Consistent 3D Shape Matching

A Framework for Double-Blind Federated Adaptation of Foundation Models

Customizing Domain Adapters for Domain Generalization

Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement

Underwater Visual SLAM with Depth Uncertainty and Medium Modeling

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

AIM: Amending Inherent Interpretability via Self-Supervised Masking

Reinforcement Learning-Guided Data Selection via Redundancy Assessment

Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance

VGGSounder: Audio-Visual Evaluations for Foundation Models

EA-Vit: Efficient Adaptation for Elastic Vision Transformer

Web Artifact Attacks Disrupt Vision Language Models

Tensor-aggregated LoRA in Federated Fine-tuning

Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark

Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

MMOne: Representing Multiple Modalities in One Scene

MM-IFEngine: Towards Multimodal Instruction Following

Knowledge Distillation with Refined Logits

CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

VisionMath: Vision-Form Mathematical Problem-Solving

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Contrastive Flow Matching

SemiVisBooster: Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic Guidance

Dataset Distillation via the Wasserstein Metric

Membership Inference Attacks with False Discovery Rate Control

Acknowledging Focus Ambiguity in Visual Questions

A Good Teacher Adapts Their Knowledge for Distillation

Evading Data Provenance in Deep Neural Networks

Boosting Adversarial Transferability via Residual Perturbation Attack

MAVias: Mitigate any Visual Bias

MPBR: Multimodal Progressive Bidirectional Reasoning for Open-Set Fine-Grained Recognition

Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product

Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning

Federated Representation Angle Learning

Federated Continual Instruction Tuning

Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities

More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Generate, Transduct, Adapt: Iterative Transduction with VLMs

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Controlling Multimodal LLMs via Reward-guided Decoding

Improving Large Vision and Language Models by Learning from a Panel of Peers

CE-FAM: Concept-Based Explanation via Fusion of Activation Maps

Leveraging Spatial Invariance to Boost Adversarial Transferability

Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

A Tiny Change, A Giant Leap: Long-Tailed Class-Incremental Learning via Geometric Prototype Alignment

PEFTDiff: Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning

One Last Attention for Your Vision-Language Model

Forgetting Through Transforming: Enabling Federated Unlearning via Class-Aware Representation Transformation

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Geometry Distributions

FastJSMA: Accelerating Jacobian-based Saliency Map Attacks through Gradient Decoupling

Boosting Class Representation via Semantically Related Instances for Robust Long-Tailed Learning with Noisy Labels

Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information

VAGUE: Visual Contexts Clarify Ambiguous Expressions

Diffusion-based Source-biased Model for Single Domain Generalized Object Detection

PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation

BATCLIP: Bimodal Online Test-Time Adaptation for CLIP

Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Verbalized Representation Learning for Interpretable Few-Shot Generalization

UIPro: Unleashing Superior Interaction Capability For GUI Agents

Loss Functions for Predictor-based Neural Architecture Search

DiMPLe - Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics

GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations

Adversarial Training for Probabilistic Robustness

RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications

Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

Backdoor Defense via Enhanced Splitting and Trap Isolation

GT-Loc: Unifying When and Where in Images through a Joint Embedding Space

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

AIRA: Activation-Informed Low-Rank Adaptation for Large Models

Social Debiasing for Fair Multi-modal LLMs

Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

LIRA: Reasoning Reconstruction via Multimodal Large Language Models

Class-Wise Federated Averaging for Efficient Personalization

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Region-based Cluster Discrimination for Visual Representation Learning

Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning

EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients

HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation

Diagnosing Pretrained Models for Out-of-distribution Detection

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Supervised Exploratory Learning for Long-Tailed Visual Recognition

Synergistic Prompting for Robust Visual Recognition with Missing Modalities

Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability

Entropy-Adaptive Diffusion Policy Optimization with Dynamic Step Alignment

Target Bias Is All You Need: Zero-Shot Debiasing of Vision-Language Models with Bias Corpus

Joint Asymmetric Loss for Learning with Noisy Labels

FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

Language-Driven Multi-Label Zero-Shot Learning with Semantic Granularity

ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence

ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning

Visual-RFT: Visual Reinforcement Fine-Tuning

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection

Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

Semi-supervised Concept Bottleneck Models

FRET: Feature Redundancy Elimination for Test Time Adaptation

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

A Unified Interpretation of Training-Time Out-of-Distribution Detection

Coupling the Generator with Teacher for Effective Data-Free Knowledge Distillation

FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields

Visual Modality Prompt for Adapting Vision-Language Object Detectors

What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

Prototype Guided Backdoor Defense via Activation Space Manipulation

Analyzing Finetuning Representation Shift for Multimodal LLMs Steering

Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers

Understanding Museum Exhibits using Vision-Language Reasoning

Looking in the Mirror: A Faithful Counterfactual Explanation Method for Interpreting Deep Image Classification Models

Improving Multimodal Learning via Imbalanced Learning

NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs

Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

AVAM: a Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Adversarial Robustness of Discriminative Self-Supervised Learning in Vision

Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization

TRNAS: A Training-Free Robust Neural Architecture Search

Staining and Locking Computer Vision Models Without Retraining

Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations

Federated Domain Generalization with Domain-specific Soft Prompts Generation

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Boosting Adversarial Transferability via Negative Hessian Trace Regularization

The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

What to Distill? Fast Knowledge Distillation with Adaptive Sampling

Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design

Federated Continuous Category Discovery and Learning

Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective

Flexi-FSCIL: Adaptive Knowledge Retention for Breaking the Stability-Plasticity Dilemma in Few-Shot Class-Incremental Learning

FDPT: Federated Discrete Prompt Tuning for Black-Box Visual-Language Models

PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning

Can Knowledge be Transferred from Unimodal to Multimodal? Investigating the Transitivity of Multimodal Knowledge Editing

Scaling Laws for Native Multimodal Models

Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image

Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

Dynamic Multimodal Prototype Learning in Vision-Language Models

Visual Intention Grounding for Egocentric Assistants

Unleashing Vecset Diffusion Model for Fast Shape Generation

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

Auto-Regressively Generating Multi-View Consistent Images

Towards Performance Consistency in Multi-Level Model Collaboration

Debiased Teacher for Day-to-Night Domain Adaptive Object Detection

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion

ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning

Task-Aware Prompt Gradient Projection for Parameter-Efficient Tuning Federated Class-Incremental Learning

Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

Effective Training Data Synthesis for Improving MLLM Chart Understanding

Learnable Logit Adjustment for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch

Where, What, Why: Towards Explainable Driver Attention Prediction

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Hypergraph Clustering Network with Partial Attribute Imputation

VRM: Knowledge Distillation via Virtual Relation Matching

Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths

You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment

Learning Visual Proxy for Compositional Zero-Shot Learning

Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Evidential Knowledge Distillation

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

VALLR: Visual ASR Language Model for Lip Reading

Robust Dataset Condensation using Supervised Contrastive Learning

CaliMatch: Adaptive Calibration for Improving Safe Semi-supervised Learning

Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection

Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning

DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection

PLAN: Proactive Low-Rank Allocation for Continual Learning

Think Twice: Test-Time Reasoning for Robust CLIP Zero-Shot Classification

Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Dataset Distillation via Vision-Language Category Prototype

Scaling and Taming Adversarial Training with Synthetic Data

A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning

Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification

Mitigating Catastrophic Overfitting in Fast Adversarial Training via Label Information Elimination

Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment

Chimera: Improving Generalist Model with Domain-Specific Experts

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Model

Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy

STEP-DETR: Advancing DETR-based Semi-Supervised Object Detection with Super Teacher and Pseudo-Label Guided Text Queries

Adversarial Reconstruction Feedback for Robust Fine-grained Generalization

Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information

KOEnsAttack: Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles

FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift

DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

Dataset Ownership Verification for Pre-trained Masked Models

Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

Feature Decomposition-Recomposition in Large Vision-Language Model for Few-Shot Class-Incremental Learning

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

ConstStyle: Robust Domain Generalization with Unified Style Transformation

Adversarial Attention Perturbations for Large Object Detection Transformers

DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion

Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection

FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Trust but Verify: Programmatic VLM Evaluation in the Wild

FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

Uncalibrated Structure from Motion on a Sphere

FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Progressive Distribution Bridging: Unsupervised Adaptation for Large-scale Pre-trained Models via Adaptive Auxiliary Data

AllGCD: Leveraging All Unlabeled Data for Generalized Category Discovery

RALoc: Enhancing Outdoor LiDAR Localization via Rotation Awareness

External Knowledge Injection for CLIP-Based Class-Incremental Learning

Cooperative Pseudo Labeling for Unsupervised Federated Classification

Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes.

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Model

AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Enhancing Numerical Prediction of MLLMs with Soft Labeling

Transparent Vision: A Theory of Hierarchical Invariant Representations

Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation

What You Have is What You Track: Adaptive and Robust Multimodal Tracking

Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning

SFUOD: Source-Free Unknown Object Detection

Activation Subspaces for Out-of-Distribution Detection

A Recipe for Generating 3D Worlds from a Single Image

On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

Learning to Inference Adaptively for Multimodal Large Language Models

Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification

Hierarchical Divide-and-Conquer Grouping for Classification Adaptation of Pre-Trained Models

Knowledge Transfer from Interaction Learning

DOGR: Towards Versatile Visual Document Grounding and Referring

Lark: Low-Rank Updates After Knowledge Localization for Few-shot Class-Incremental Learning

Advancing Textual Prompt Learning with Anchored Attributes

Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning

SHIFT: Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency

A Conditional Probability Framework for Compositional Zero-shot Learning

Backdooring Self-Supervised Contrastive Learning by Noisy Alignment

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Enhancing Transferability of Targeted Adversarial Examples via Inverse Target Gradient Competition and Spatial Distance Stretching

FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning

LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

Diversity-Enhanced Distribution Alignment for Dataset Distillation

Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification

Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning

Sibai: A Few-Shot Meta-Classifier for Poisoning Detection in Federated Learning

Generalization-Preserved Learning: Closing the Backdoor to Catastrophic Forgetting in Continual Deepfake Detection

Dual Domain Control via Active Learning for Remote Sensing Domain Incremental Object Detection

Gradient Extrapolation for Debiased Representation Learning

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

FedAGC: Federated Continual Learning with Asymmetric Gradient Correction

BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

FREE-Merging: Fourier Transform for Efficient Model Merging

RANKCLIP: Ranking-Consistent Language-Image Pretraining

Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

FLSeg: Enhancing Privacy and Robustness in Federated Learning under Heterogeneous Data via Model Segmentation

ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

Learning Interpretable Queries for Explainable Image Classification with Information Pursuit

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving

SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

SPD: Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection

To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models

Inference-Time Diffusion Model Distillation

G2D: Boosting Multimodal Learning with Gradient-Guided Distillation

Personalized Federated Learning under Local Supervision

Token Activation Map to Visually Explain Multimodal LLMs

Differentiable Room Acoustic Rendering with Multi-View Vision Priors

Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery

Prototype-based Contrastive Learning with Stage-wise Progressive Augmentation for Self-Supervised Fine-Grained Learning

Radiant Foam: Real-Time Differentiable Ray Tracing

COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition

ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Predictions

Information Density Principle for MLLM Benchmarks

Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation

Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning

Neural Architecture Search Driven by Locally Guided Diffusion for Personalized Federated Learning

Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation

PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Attention to the Burtiness in Visual Prompt Tuning!

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Zero-Shot Vision Encoder Grafting via LLM Surrogates

Long-Tailed Classification with Multi-Granularity Semantics

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

ReTracker: Exploring Image Matching for Robust Online Any Point Tracking

HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding

Backdoor Attacks on Neural Networks via One-Bit Flip

A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning

Exploring the Visual Feature Space for Multimodal Neural Decoding

Joint Diffusion Models in Continual Learning

TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

SMP-Attack: Boosting the Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout

Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method

Backdoor Mitigation by Distance-Driven Detoxification

Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

Gradient Decomposition and Alignment for Incremental Object Detection

Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

CAVIS: Context-Aware Video Instance Segmentation

Unlearning the Noisy Correspondence Makes CLIP More Robust

FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection

Local Dense Logit Relations for Enhanced Knowledge Distillation

Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation

Differentially Private Fine-Tuning of Diffusion Models

FedXDS: Leveraging Model Attribution Methods to counteract Data Heterogeneity in Federated Learning

Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning

Consensus-Driven Active Model Selection

Adversarial Purification via Super-Resolution and Diffusion

FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

CMAD: Correlation-Aware and Modalities-Aware Distillation for Multimodal Sentiment Analysis with Missing Modalities

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Revelio: Interpreting and leveraging semantic information in diffusion models

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training

VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders

SplatTalk: 3D VQA with Gaussian Splatting

CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

MambaML: Exploring State Space Models for Multi-Label Image Classification

MUNBa: Machine Unlearning via Nash Bargaining

SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking

Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection

Enhancing Transformers Through Conditioned Embedded Tokens

Improved Noise Schedule for Diffusion Training

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

One Encoder to Rule them All: Representation Learning for Model-free Visual Reinforcement Learning using Fourier Neural Operators

LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Improving Noise Efficiency in Privacy-preserving Dataset Distillation

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning

RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection

On the Robustness Tradeoff in Fine-Tuning

Understanding Flatness in Generative Models: Its Role and Benefits

(ends 1:45 PM)

1:30 p.m.

Oral 2A: View Synthesis and Scene Reconstruction [1:30-2:45]

Orals 1:45-3:00

[1:45] RayZer: A Self-supervised Large View Synthesis Model

[2:00] EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

[2:15] Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis

[2:30] Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

[2:45] SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

(ends 2:45 PM)

Oral 2B: Efficient Learning [1:30-2:30]

Orals 1:45-3:00

[1:45] Variance-Based Pruning for Accelerating and Compressing Trained Networks

[2:00] Importance-Based Token Merging for Efficient Image and Video Generation

[2:15] Knowledge Distillation for Learned Image Compression

[2:30] Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

[2:45] Heavy Labels Out! Dataset Distillation with Label Space Lightening

(ends 2:30 PM)

3 p.m.

Poster Session 2 & Exhibit Hall with Coffee Break [3:00-5:00]

Posters 3:15-5:15

HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

Time-Aware Auto White Balance in Mobile Photography

Detect Anything 3D in the Wild

VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions

Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation

Quanta Neural Networks: From Photons to Perception

EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

Depth Any Event Stream: Enhancing Event-based Monocular Depth Estimation via Dense-to-Sparse Distillation

PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization

WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Selective Contrastive Learning for Weakly Supervised Affordance Grounding

OpenSubstance: A High-quality Measured Dataset of Multi-View and -Lighting Images and Shapes

MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion

PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting

AllTracker: Efficient Dense Point Tracking at High Resolution

PseudoMapTrainer: Learning Online Mapping without HD Maps

LONG3R: Long Sequence Streaming 3D Reconstruction

DAMap: Distance-aware MapNet for High Quality HD Map Construction

VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding

OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance

AnnofreeOD: Detecting All Classes at Low Frame Rates Without Human Annotations

PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View

Learning 4D Embodied World Models

DAViD: Data-efficient and Accurate Vision Models from Synthetic Data

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Multi-view Gaze Target Estimation

Bayesian-Inspired Space-Time Superpixels

Multimodal LLM Guided Exploration and Active Mapping using Fisher Information

Multispectral Demosaicing via Dual Cameras

DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

Generative Modeling of Shape-Dependent Self-Contact Human Poses

Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation

PBFG: A New Physically-Based Dataset and Removal of Lens Flares and Glares

Met2Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems

GSOT3D: Towards Generic 3D Single Object Tracking in the Wild

MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding

Knowledge-Guided Part Segmentation

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions

NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments

Beyond RGB: Adaptive Parallel Processing for RAW Object Detection

GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene

SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images

egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks

PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing

Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration

PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description

TorchAdapt: Towards Light-Agnostic Real-Time Visual Perception

Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling

FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction

CAT: A Unified Click-and-Track Framework for Realistic Tracking

Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor

GARF: Learning Generalizable 3D Reassembly for Real-World Fractures

DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching

Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

SAC-GNC: SAmple Consensus for adaptive Graduated Non-Convexity

Performing Defocus Deblurring by Modeling its Formation Process

Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform

AstroLoc: Robust Space to Ground Image Localizer

RayZer: A Self-supervised Large View Synthesis Model

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Real3D: Towards Scaling Large Reconstruction Models with Real Images

Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching

PanSt3R: Multi-view Consistent Panoptic Segmentation

Stochastic Interpolants for Revealing Stylistic Flows across the History of Art

Is Tracking really more challenging in First Person Egocentric Vision?

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Less is More: Empowering GUI Agent with Context-Aware Simplification

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Toward Material-Agnostic System Identification from Videos

MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips

Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition

Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction

CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation

ETA: Energy-based Test-time Adaptation for Depth Completion

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

SceneMI: Motion In-betweening for Modeling Human-Scene Interaction

HORT: Monocular Hand-held Objects Reconstruction with Transformers

Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection

Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery

Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond

ArgoTweak: Towards Self-Updating HD Maps through Structured Priors

DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image

GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement

GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

WildSAT: Learning Satellite Image Representations from Wildlife Observations

RoMo: Robust Motion Segmentation Improves Structure from Motion

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion

Robust Low-light Scene Restoration via Illumination Transition

CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

CO2-Net: A Physics-Informed Spatio-Temporal Model for Global Surface CO2 Reconstruction

Zero-shot Inexact CAD Model Alignment from a Single Image

Balanced Sharpness-Aware Minimization for Imbalanced Regression

Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation

HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

Mamba-3VL: Taming State Space Model for 3D Vision Language Learning

Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer

DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation

OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection

Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations

Scaling 3D Compositional Models for Robust Classification and Pose Estimation

RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Optical Model-Driven Sharpness Mapping for Autofocus in Small Depth-of-Field and Severe Defocus Scenarios

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation

VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding

UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions

Prior-aware Dynamic Temporal Modeling Framework for Sequential 3D Hand Pose Estimation

Epipolar Consistent Attention Aggregation Network for Unsupervised Light Field Disparity Estimation

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling

GloPER: Unsupervised Animal Pattern Extraction from Local Reconstruction

ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives

MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

On the Generalization of Representation Uncertainty in Earth Observation

MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction

Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos

PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling

Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing

Low-Light Image Enhancement using Event-Based Illumination Estimation

Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Jigsaw++: Imagining Complete Shape Priors for Object Reassembly

Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation

SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection

IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection

Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory

Head2Body: Body Pose Generation from Multi-sensory Head-mounted Inputs

Task-Decoupled Bézier Surface Constraint for Uneven Low-Light Image Enhancement

Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection

PlaneRAS: Learning Planar Primitives for 3D Plane Recovery

O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views

FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

How To Make Your Cell Tracker Say "I dunno!"

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Simultaneous Motion And Noise Estimation with Event Cameras

EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

Weakly-Supervised Learning of Dense Functional Correspondences

Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection

Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs

CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

Embodied Navigation with Auxiliary Task of Action Description Prediction

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Learning an Implicit Physics Model for Image-based Fluid Simulation

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth

PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation

4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads

Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks

Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Spatial-Temporal Aware Visuomotor Diffusion Policy Learning

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras

3D Mesh Editing using Masked LRMs

HccePose (BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation

Information-Bottleneck Driven Binary Neural Network for Change Detection

GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting

VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition

Event-based Tiny Object Detection: A Benchmark Dataset and Baselines

PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Neural Solver of Dichromatic Reflection Model for Specular Highlight Removal

Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion

Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

Diffusion-based 3D Hand Motion Recovery with Intuitive Physics

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Measuring the Impact of Rotation Equivariance on Aerial Object Detection

Exploring View Consistency for Scene-Adaptive Low-Light Light Field Image Enhancement

VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions

Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm

TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention

TerraMind: Large-Scale Generative Multimodality for Earth Observation

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Heavy Labels Out! Dataset Distillation with Label Space Lightening

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring

VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking

Unsupervised Identification of Protein Compositions and Conformations via Implicit Content-Transformation Disentanglement

Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

Future-Aware Interaction Network For Motion Forecasting

EventUPS: Uncalibrated Photometric Stereo Using an Event Camera

PHD: Personalized 3D Human Body Fitting with Point Diffusion

Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

Language Driven Occupancy Prediction

Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function

C4D: 4D Made from 3D through Dual Correspondences

Instance-Level Video Depth in Groups Beyond Occlusions

ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

Active Perception Meets Rule-Guided RL: A Two-Phase Approach for Precise Object Navigation in Complex Environments

MonoSOWA: Scalable monocular 3D Object detector Without human Annotations

Estimating 2D Camera Motion with Hybrid Motion Basis

AgroBench: Vision-Language Model Benchmark in Agriculture

Princeton365: A Diverse Dataset with Accurate Camera Pose

H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

From Abyssal Darkness to Blinding Glare: A Benchmark on Extreme Exposure Correction in Real World

Learning to See in the Extremely Dark

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Voyaging into Perpetual Dynamic Scenes from a Single View

PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency

ARMO: Autoregressive Rigging for Multi-Category Objects

DCHM: Depth-Consistent Human Modeling for Multiview Detection

Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge

InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization

Find Any Part in 3D

Learning 3D Scene Analogies with Neural Contextual Scene Maps

GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects

Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation

M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method

Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

SpikeDiff: Zero-shot High-Quality Video Reconstruction from Chromatic Spike Camera and Sub-millisecond Spike Streams

VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting

AJAHR: Amputated Joint Aware 3D Human Mesh Recovery

Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime

EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks

A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba

Learning Normal Flow Directly From Events

Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras

OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Trial-Oriented Visual Rearrangement

6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting

AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration

Background Invariance Testing According to Semantic Proximity

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

One Look is Enough: Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation on High-Resolution Images

Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision

RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration

PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Training-Free Generation of Temporally Consistent Rewards from VLMs

LGA-Net: Learning Local and Global Affinities for Sparse Scribble based Image Colorization

CObL: Toward Zero-Shot Ordinal Layering without User Prompting

Hierarchical Material Recognition from Local Appearance

Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization

MVGBench: a Comprehensive Benchmark for Multi-view Generation Models

Harnessing Input-Adaptive Inference for Efficient VLN

From One to More: Contextual Part Latents for 3D Generation

TopicGeo: An Efficient Unified Framework for Geolocation

Importance-Based Token Merging for Efficient Image and Video Generation

Knowledge Distillation for Learned Image Compression

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

Partially Matching Submap Helps: Uncetainty Modeling and Propagation for Text to Point Cloud Localization

SAS: Segment Any 3D Scene with Integrated 2D Priors

Medical World Model

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking

ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting

DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild

OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images

Passing the Driving Knowledge Test

Uncertainty-Aware Gradient Stabilization for Small Object Detection

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Towards Annotation-Free Evaluation: KPAScore for Human Keypoint Detection

4D Visual Pre-training for Robot Learning

CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy

Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes

HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

Generative Zoo

St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

DialNav: Multi-turn Dialog Navigation with a Remote Guide

Event-guided Unified Framework for Low-light Video Enhancement, Frame Interpolation, and Deblurring

Aether: Geometric-Aware Unified World Modeling

PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image

PLMP - Point-Line Minimal Problems for Projective SfM

PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement

TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation

Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Aligning Constraint Generation with Design Intent in Parametric CAD

Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion

Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints

Environment-Agnostic Pose: Generating Environment-independent Object Representations for 6D Pose Estimation

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning

Online Dense Point Tracking with Streaming Memory

TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

Long-Context State-Space Video World Models

MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting

When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training

GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts

CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector

CWNet: Causal Wavelet Network for Low-Light Image Enhancement

MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos

Test-Time Retrieval-Augmented Adaptation for Vision-Language Models

TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

RnGCam: High-speed video from rolling & global shutter measurements

SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting

FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling

Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures

SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

Learning on the Go: A Meta-learning Object Navigation Model

PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation

Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users

HFD-Teacher: High-Frequency Depth Distillation from Depth Foundation Models for Enhanced Depth Completion

Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Benchmarking Multimodal Large Language Models Against Image Corruptions

MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

SITE: towards Spatial Intelligence Thorough Evaluation

PASD: A Pixel-Adaptive Swarm Dynamics Approach for Unsupervised Low-Light Image Enhancement

EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

Variance-Based Pruning for Accelerating and Compressing Trained Networks

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation

Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing

ReCoT: Reflective Self-Correction Training for Mitigating Confirmation Bias in Large Vision-Language Models

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

GenHaze: Pioneering Controllable One-Step Realistic Haze Generation for Real-World Dehazing

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

After the Party: Navigating the Mapping From Color to Ambient Lighting

Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration

BlinkTrack: Feature Tracking over 80 FPS via Events and Images

GECO: Geometrically Consistent Embedding with Lightspeed Inference

Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes

EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation

Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images

FusionPhys: A Flexible Framework for Fusing Complementary Sensing Modalities in Remote Physiological Measurement

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

Fine-grained Spatiotemporal Grounding on Egocentric Videos

Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images

From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras

CMB-ML: A Cosmic Microwave Background Dataset for the Oldest Possible Computer Vision Task

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Test-Time Prompt Tuning for Zero-Shot Depth Completion

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry

monoVLN: Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation

Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting

Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation

BokehDiff: Neural Lens Blur with One-Step Diffusion

LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling

Closed-Loop Transfer for Weakly-supervised Affordance Grounding

Combinative Matching for Geometric Shape Assembly

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

DyGS-SLAM: Real-Time Accurate Localization and Gaussian Reconstruction for Dynamic Scenes

Teaching VLMs to Localize Specific Objects from In-context Examples

Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection

Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection

MeasureXpert: Automatic Anthropometric Measurement Extraction from Two Unregistered, Partial, Posed, and Dressed Body Scans

SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics

Shape of Motion: 4D Reconstruction from a Single Video

Amodal Depth Anything: Amodal Depth Estimation in the Wild

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

PartField: Learning 3D Feature Fields for Part Segmentation and Beyond

RetinexMCNet: A Memory Controller Dominated Network for Low-Light Video Enhancement Based on Retinex

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification

CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance

Removing Out-of-Focus Reflective Flares via Color Alignment

Proactive Scene Decomposition and Reconstruction

Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision

A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition

IRASim: A Fine-Grained World Model for Robot Manipulation

WalkVLM: Aid Visually Impaired People Walking by Vision Language Model

CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

LDPose: Towards Inclusive Human Pose Estimation for Limb-Deficient Individuals in the Wild

(ends 5:00 PM)

Demonstration:

Demos 2

(ends 5:00 PM)

WED 22 OCT

7:30 a.m.

Registration/Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8 a.m.

Oral 3A: Foundation models and representation learning [8:00-9:15]

Orals 8:00-9:30

[8:00] RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

[8:15] Towards a Unified Copernicus Foundation Model for Earth Vision

[8:30] Learning Streaming Video Representation via Multitask Training

[8:45] LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

[9:00] Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

[9:15] GMMamba: Group Masking Mamba for Whole Slide Image Classification

(ends 9:15 AM)

Oral 3B: Human Modeling [8:00-9:15]

Orals 8:00-9:30

[8:00] NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

[8:15] MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

[8:30] HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars

[8:45] Understanding Co-speech Gestures in-the-wild

[9:00] DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

[9:15] Teeth Reconstruction and Performance Capture Using a Phone Camera

(ends 9:15 AM)

9:15 a.m.

Break:

Coffee Break

(ends 10:15 AM)

9:30 a.m.

Keynote:

On Perseverance: Virtually Unwrapping the Herculaneum Scrolls

Brent Seales

(ends 10:30 AM)

10:45 a.m.

Demonstration:

Demos 3

(ends 12:45 PM)

Poster Session 3 & Exhibit Hall [10:45-12:45]

Posters 10:45-1:15

AnimalClue: Recognizing Animals by their Traces

Error Recognition in Procedural Videos using Generalized Task Graph

MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition Dataset

COVTrack: Continuous Open-Vocabulary Tracking via Adaptive Multi-Cue Fusion

CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text

GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation

RoboPearls: Editable Video Simulation for Robot Manipulation

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset

Exploiting Diffusion Prior for Task-driven Image Restoration

Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition

Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

PVMamba: Parallelizing Vision Mamba via Dynamic State Aggregation

FlowStyler: Artistic Video Stylization via Transformation Fields Transports

Controllable and Expressive One-Shot Video Head Swapping

Multi-modal Multi-platform Person Re-Identification: Benchmark and Method

HERO: Human Reaction Generation from Videos

Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection

What If: Understanding Motion Through Sparse Interactions

PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models

DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration

FaceShield: Defending Facial Image against Deepfake Threats

Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

Expressive Talking Human from Single-Image with Imperfect Priors

InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians

Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance

Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models

Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin

VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising

Reverse Convolution and Its Applications to Image Restoration

MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence

Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation

DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

FreeDance: Towards Harmonic Free-Number Group Dance Generation via a Unified Framework

SILO: Solving Inverse Problems with Latent Operators

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

X-Dancer: Expressive Music to Human Dance Video Generation

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures

Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes

AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation

Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars

AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm

PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups

TeRA: Rethinking Text-guided Realistic 3D Avatar Generation

A Unified Framework for Motion Reasoning and Generation in Human Interaction

Open-World Skill Discovery from Unsegmented Demonstration Videos

CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

Deep Adaptive Unfolded Network via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening

EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes

Reference-based Super-Resolution via Image-based Retrieval-Augmented Generation Diffusion

Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation

Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

Multi-modal Identity Extraction

NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Joint Self-Supervised Video Alignment and Action Segmentation

VSSD: Vision Mamba with Non-Causal State Space Duality

EgoM2P: Egocentric Multimodal Multitask Pretraining

Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions

E-NeMF: Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes

LayerAnimate: Layer-level Control for Animation

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

HUMOTO: A 4D Dataset of Mocap Human Object Interactions

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Enhanced Pansharpening via Quaternion Spatial-Spectral Interactions

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

UDC-VIT: A Real-World Video Dataset for Under-Display Cameras

Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation

Blind Video Super-Resolution based on Implicit Kernels

F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer

Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Blind Noisy Image Deblurring Using Residual Guidance Strategy

Drawing Developmental Trajectory from Cortical Surface Reconstruction

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Less is More: Improving Motion Diffusion Models with Sparse Keyframes

DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification

Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking

Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

Learning Hierarchical Line Buffer for Image Processing

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

TrackVerse: A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Human-Object Interaction from Human-Level Instructions

KinMo: Kinematic-aware Human Motion Understanding and Generation

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection

Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior

MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition

Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion

EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation

PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

OneGT: One-Shot Geometry-Texture Neural Rendering for Head Avatars

MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent

D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Disentangled Clothed Avatar Generation with Layered Representation

Augmented Mass-Spring Model for Real-Time Dense Hair Simulation

Punching Bag vs. Punching Person: Motion Transferability in Videos

OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics

FaceXFormer: A Unified Transformer for Facial Analysis

ContextFace: Generating Facial Expressions from Emotional Contexts

Laboring on less labors: RPCA Paradigm for Pan-sharpening

ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer

What we need is explicit controllability: Training 3D gaze estimator using only facial images

Riemannian-Geometric Fingerprints of Generative Models

Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching

Diffusion-Based Imaginative Coordination for Bimanual Manipulation

WarpHE4D: Dense 4D Head Map toward Full Head Reconstruction

PrimHOI: Compositional Human-Object Interaction via Reusable Primitives

Continuous-Time Human Motion Field from Event Cameras

Efficient Track Anything

HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID

Multi-Object Sketch Animation by Scene Decomposition and Motion Planning

ISP2HRNet: Learning to Reconstruct High Resolution Image from Irregularly Sampled Pixels via Hierarchical Gradient Learning

LDIP: Long Distance Information Propagation for Video Super-Resolution

MBTI: Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation

Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

GameFactory: Creating New Games with Generative Interactive Videos

MOSCATO: Predicting Multiple Object State Change Through Actions

FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion

MOVE: Motion-Guided Few-Shot Video Object Segmentation

V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

EAMamba: Efficient All-Around Vision State Space Model for Image Restoration

Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Fast Image Super-Resolution via Consistency Rectified Flow

GENMO: A GENeralist Model for Human MOtion

VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching

Event-guided HDR Reconstruction with Diffusion Priors

Learning Efficient and Generalizable Human Representation with Human Gaussian Model

SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models

AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance

Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition

Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need

MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Robust Adverse Weather Removal via Spectral-based Spatial Grouping

CarGait: Cross-Attention based Re-ranking for Gait recognition

Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables

WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Unsupervised Visible-Infrared Person Re-identification under Unpaired Settings

LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning

Multi-identity Human Image Animation with Structural Video Diffusion

Embodied Representation Alignment with Mirror Neurons

EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow

Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image

You Think, You ACT: The New Task of Arbitrary Text to Motion Generation

EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

PersonaCraft: Personalized and Controllable Full-Body Multi-Human Scene Generation Using Occlusion-Aware 3D-Conditioned Diffusion

DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing

Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Monocular Facial Appearance Capture in the Wild

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Skeleton Motion Words for Unsupervised Skeleton-based Temporal Action Segmentation

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation

Synthetic Video Enhances Physical Fidelity in Video Synthesis

TimeBooth: Disentangled Facial Invariant Representation for Diverse and Personalized Face Aging

DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

Identity Preserving 3D Head Stylization with Multiview Score Distillation

IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Towards Efficient General Feature Prediction in Masked Skeleton Modeling

Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes

VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos

MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

Occlusion-robust Stylization for Drawing-based 3D Animation

Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors

Video Individual Counting for Moving Drones

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

HADES: Human Avatar with Dynamic Explicit Hair Strands

FlowDPS : Flow-Driven Posterior Sampling for Inverse Problems

ZFusion: Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior

Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars

Learning Streaming Video Representation via Multitask Training

DreamRelation: Relation-Centric Video Customization

ModSkill: Physical Character Skill Modularization

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association

Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation

Learning A Unified Template for Gait Recognition

Sliced Wasserstein Bridge for Open-Vocabulary Video Instance Segmentation

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Synchronization of Multiple Videos

DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation

VertexRegen: Mesh Generation with Continuous Level of Detail

Multimodal Prompt Alignment for Facial Expression Recognition

Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning

PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover

Precise Action-to-Video Generation Through Visual Action Prompts

PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Consistency Trajectory Matching for One-Step Generative Super-Resolution

Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training

Neuromanifold-Regularized KANs for Shape-fair Feature Representations

Learning to Generalize without Bias for Open-Vocabulary Action Recognition

GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar

MotionFollower: Editing Video Motion via Score-Guided Diffusion

InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild

MoFRR: Mixture of Diffusion Models for Face Retouching Restoration

D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

GAS: Generative Avatar Synthesis from a Single Image

Less Static, More Private: Towards Transferable Privacy-Preserving Action Recognition by Generative Decoupled Learning

Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

MR-FIQA: Face Image Quality Assessment with Multi-Reference Representations from Synthetic Data Generation

Text-to-Any-Skeleton Motion Generation Without Retargeting

Blind2Sound: Self-Supervised Image Denoising without Residual Noise

STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints

ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Towards a Universal Image Degradation Model via Content-Degradation Disentanglement

Unified Multimodal Understanding via Byte-Pair Visual Encoding

IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

AdaDCP: Learning an Adapter with Discrete Cosine Prior for Clear-to-Adverse Domain Generalization

MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration

MorphoGen: Efficient Unconditional Generation of Long-Range Projection Neuronal Morphology via a Global-to-Local Framework

Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding

Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

Stylized-Face: A Million-level Stylized Face Dataset for Face Recognition

GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Capturing head avatar with hand contacts from a monocular video

Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images

BlueNeg: A 35mm Negative Film Dataset for Restoring Channel-Heterogeneous Deterioration

GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control

MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

Privacy-centric Deep Motion Retargeting for Anonymization of Skeleton-Based Motion Visualization

ChartCap: Mitigating Hallucination of Dense Chart Captioning

GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning

Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

GMMamba: Group Masking Mamba for Whole Slide Image Classification

Understanding Co-speech Gestures in-the-wild

Motion Synthesis with Sparse and Flexible Keyjoint Control

UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

MMAD: Multi-label Micro-Action Detection in Videos

UniRes: Universal Image Restoration for Complex Degradations

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Gait-X: Exploring X modality for Generalized Gait Recognition

MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost

A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions

GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting

MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion

I2V3D: Controllable Image-to-video Generation with 3D Guidance

Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing

FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

General Compression Framework for Efficient Transformer Object Tracking

DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors

StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

Unlocking the Potential of Diffusion Priors in Blind Face Restoration

Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

A₀ : An Affordance-Aware Hierarchical Model for General Robotic Manipulation

DisenQ: Disentangling Q-Former for Activity-Biometrics

The Source Image is the Best Attention for Infrared and Visible Image Fusion

HUST: High-Fidelity Unbiased Skin Tone Estimation via Texture Quantization

AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

Auto-Regressive Transformation for Image Alignment

Controllable Weather Synthesis and Removal with Video Diffusion Models

Sequential Gaussian Avatars with Hierarchical Motion Context

TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

T2Bs: Text-to-Character Blendshapes via Video Generation

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Unfolding-Associative Encoder-Decoder Network with Progressive Alignment for Pansharpening

AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation

MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

LOMM: Latest Object Memory Management for Temporally Consistent Video Instance Segmentation

Video Motion Graphs

Online Generic Event Boundary Detection

SD2Actor: Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation

VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction

EVDM: Event-based Real-world Video Deblurring with Mamba

From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning

Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer

Intra-modal and Cross-modal Synchronization for Audio-visual Deepfake Detection and Temporal Localization

FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Context-Aware Academic Emotion Dataset and Benchmark

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion

iManip: Skill-Incremental Learning for Robotic Manipulation

Q-Norm: Robust Representation Learning via Quality-Adaptive Normalization

Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction

MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization

MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

π-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis?

Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions

SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions

RobAVA: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding

IDFace: Face Template Protection for Efficient and Secure Identification

Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

Dual-level Prototype Learning for Composite Degraded Image Restoration

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

I2VControl: Disentangled and Unified Video Motion Synthesis Control

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

On-Device Diffusion Transformer Policy for Efficient Robot Manipulation

Generic Event Boundary Detection via Denoising Diffusion

Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement

Multi-Modal Few-Shot Temporal Action Segmentation

SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Unified Adversarial Augmentation for Improving Palmprint Recognition

Not All Degradations Are Equal: A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution

SHeaP: Self-supervised Head Geometry Predictor Learned via 2D Gaussians

MultiModal Action Conditioned Video Simulation

LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds

Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model

GUAVA: Generalizable Upper Body 3D Gaussian Avatar

Recognizing Actions from Robotic View for Natural Human-Robot Interaction

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Democratizing High-Fidelity Co-Speech Gesture Video Generation

Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration

Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generations

IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution

TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

DIMO: Diverse 3D Motion Generation for Arbitrary Objects

OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition

Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling

Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation

NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

Dense Policy: Bidirectional Autoregressive Learning of Actions

PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration

Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection

MistSense: Versatile Online Detection of Procedural and Execution Mistakes

SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting

Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions

LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables

LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding

Temporal Rate Reduction Clustering for Human Motion Segmentation

DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing

Efficient Concertormer for Image Deblurring and Beyond

ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy

InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Face Retouching with Diffusion Data Generation and Spectral Restorement

Separation for Better Integration: Disentangling Edge and Motion in Event-based Deblurring

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

Flow Stochastic Segmentation Networks

IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization

Towards a Unified Copernicus Foundation Model for Earth Vision

Teeth Reconstruction and Performance Capture Using a Phone Camera

(ends 12:45 PM)

11 a.m.

Break:

Lunch

(ends 1:00 PM)

Doctoral Consortium [11:00-1:00]

(ends 1:00 PM)

1 p.m.

Oral 4A: Vision + graphics [1:00-2:15]

Orals 1:15-2:30

[1:15] Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

[1:30] Generating Physically Stable and Buildable Brick Structures from Text

[1:45] WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction

[2:00] SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

[2:15] ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

(ends 2:15 PM)

Oral 4B: 3D Pose Understanding [1:00-2:15]

Orals 1:15-2:30

[1:15] Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

[1:30] Certifiably Optimal Anisotropic Rotation Averaging

[1:45] Deterministic Object Pose Confidence Region Estimation

[2:00] RePoseD: Efficient Relative Pose Estimation With Known Depth Information

[2:15] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

(ends 2:15 PM)

2:30 p.m.

Demonstration:

Demos 4

(ends 4:30 PM)

Poster Session 4 & Exhibit Hall with Coffee Break [2:30-4:00]

Posters 2:45-4:45

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models

Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal

OminiControl: Minimal and Universal Control for Diffusion Transformer

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Penalizing Boundary Activation for Object Completeness in Diffusion Models

MatchDiffusion: Training-free Generation of Match-Cuts

Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Straighten Viscous Rectified Flow via Noise Optimization

Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting

Scalable Dual Fingerprinting for Hierarchical Attribution of Text-to-Image Models

QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

CRAM: Large Scale Video Continual Learning with Bootstrapped Compression

Learning Robust Image Watermarking with Lossless Cover Recovery

Cross-Subject Mind Decoding from Inaccurate Representations

Tree-NeRV: Efficient Non-Uniform Sampling for Neural Video Representation via Tree-Structured Feature Grids

HPSv3: Towards Wide-Spectrum Human Preference Score

SummDiff: Generative Modeling of Video Summarization with Diffusion

Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

LEGO-Maker: A Semantic-Driven Algorithm for Text-to-3D Generation

ForCenNet: Foreground-Centric Network for Document Image Rectification

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Textured 3D Regenerative Morphing with 3D Diffusion Prior

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

Efficient Input-level Backdoor Defense on Text-to-Image Synthesis via Neuron Activation Variation

CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation

MagicColor: Multi-instance Sketch Colorization

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

SDMatte: Grafting Diffusion Models for Interactive Matting

Adaptive Caching for Faster Video Generation with Diffusion Transformers

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting

Edicho: Consistent Image Editing in the Wild

Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer

LUSD: Localized Update Score Distillation for Text-Guided Image Editing

FlowChef: Steering of Rectified Flow Models for Controlled Generations

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests

Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models

Grouped Speculative Decoding for Autoregressive Image Generation

Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization

Progressive Artwork Outpainting via Latent Diffusion Models

SynTag: Enhancing the Geometric Robustness of Inversion-based Generative Image Watermarking

Text Embedding Knows How to Quantize Text-Guided Diffusion Models

Scene Graph Guided Generation: Enable Accurate Relations Generation in Text-to-Image Models via Textural Rectification

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

NeuralSVG: An Implicit Representation for Text-to-Vector Generation

IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models

DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

Anti-Tamper Protection for Unauthorized Individual Image Generation

Continual Personalization for Diffusion Models

FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning

LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

An Information-Theoretic Regularizer for Lossy Neural Image Compression

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Timestep-Aware Diffusion Model for Extreme Image Rescaling

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Split-and-Combine: Enhancing Style Augmentation for Single Domain Generalization

Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

TryOn-Refiner: Conditional Rectified-flow-based TryOn Refiner for More Accurate Detail Reconstruction

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Cassic: Towards Content-Adaptive State-Space Models for Learned Image Compression

FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention

Zero-Shot Depth Aware Image Editing with Diffusion Models

StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance

Global and Local Entailment Learning for Natural World Imagery

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Multi-turn Consistent Image Editing

Domain Generalizable Portrait Style Transfer

TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

Who Controls the Authorization? Invertible Networks for Copyright Protection in Text-to-Image Synthesis

SpectralAR: Spectral Autoregressive Visual Generation

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation

The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation

MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

Rethink Sparse Signals for Pose-guided Text-to-image Generation

Steering Guidance for Personalized Text-to-Image Diffusion Models

CAFA: a Controllable Automatic Foley Artist

M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization

ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

EditCLIP: Representation Learning for Image Editing

Magic Insert: Style-Aware Drag-and-Drop

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

DIVE: Taming DINO for Subject-Driven Video Editing

FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process

PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Scalable Image Tokenization with Index Backpropagation Quantization

Enhancing Image Restoration Transformer via Adaptive Translation Equivariance

CharaConsist: Fine-Grained Consistent Character Generation

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting

Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control

Guiding Diffusion Models with Adaptive Negative Sampling Without External Resources

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

Beyond Perspective: Neural 360-Degree Video Compression

MCID: Multi-aspect Copyright Infringement Detection for Generated Images

Text2Outfit: Controllable Outfit Generation with Multimodal Language Models

Outlier-Aware Post-Training Quantization for Image Super-Resolution

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

SEAL: Semantic Aware Image Watermarking

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

STIV: Scalable Text and Image Conditioned Video Generation

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

D3QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models

One-Step Specular Highlight Removal with Adapted Diffusion Models

ART: Adaptive Relation Tuning for Generalized Relation Prediction

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

Make Me Happier: Evoking Emotions Through Image Diffusion Models

MV-Adapter: Multi-View Consistent Image Generation Made Easy

On Large Multimodal Models as Open-World Image Classifiers

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate

Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Addressing Text Embedding Leakage in Diffusion-based Image Editing

JailbreakDiffBench: A Comprehensive Benchmark for Jailbreaking Diffusion Models

GReg: Geometry-Aware Region Refinement for Sign Language Video Generation

RePoseD: Efficient Relative Pose Estimation With Known Depth Information

Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Cross-Granularity Online Optimization with Masked Compensated Information for Learned Image Compression

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Deeply Supervised Flow-Based Generative Models

Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation

What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Your Text Encoder Can Be An Object-Level Watermarking Controller

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Stable Score Distillation

KV-Edit: Training-Free Image Editing for Precise Background Preservation

Edit360: 2D Image Edits to 3D Assets from Any Angle

FlowTok: Flowing Seamlessly Across Text and Image Tokens

``Principal Components" Enable A New Language of Images

TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

FiVE-Bench: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

OmniVTON: Training-Free Universal Virtual Try-On

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Vision-Language Interactive Relation Mining for Open-Vocabulary Scene Graph Generation

YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

UnZipLoRA: Separating Content and Style from a Single Image

Generative Adversarial Diffusion

Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

Co-Painter: Fine-Grained Controllable Image Stylization via Implicit Decoupling and Adaptive Injection

AIComposer: Any Style and Content Image Composition via Feature Integration

PLA: Prompt Learning Attack against Text-to-Image Generative Models

Rethinking Layered Graphic Design Generation with a Top-Down Approach

Memory-Efficient Generative Models via Product Quantization

Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Holistic Tokenizer for Autoregressive Image Generation

Scendi Score: Prompt‑Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

Toward Better Out-painting: Improving the Image Composition with Initialization Policy Model

From Image to Video: An Empirical Study of Diffusion Representations

An Inversion-based Measure of Memorization for Diffusion Models

CAP: Evaluation of Persuasive and Creative Image Generation

Versatile Transition Generation with Image-to-Video Diffusion

Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

DiffIP: Representation Fingerprints for Robust IP Protection of Diffusion Models

FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

Spatial-Temporal Forgery Trace based Forgery Image Identification

AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild

Aligning Global Semantics and Local Textures in Generative Video Enhancement

Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection

HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Preacher: Paper-to-Video Agentic System

Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models

Spectral Image Tokenizer

VACE: All-in-One Video Creation and Editing

RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

LOTA: Bit-Planes Guided AI-Generated Image Detection

Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Trade-offs in Image Generation: How Do Different Dimensions Interact?

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Long Context Tuning for Video Generation

DreamFuse: Adaptive Image Fusion with Diffusion Transformer

AnyI2V: Animating Any Conditional Image with Motion Control

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

PlugMark: A Plug-in Zero-Watermarking Framework for Diffusion Models

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Balanced Image Stylization with Style Matching Score

GIViC: Generative Implicit Video Compression

Streamlining Image Editing with Layered Diffusion Brushes

StableCodec: Taming One-Step Diffusion for Extreme Image Compression

Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis

Can We Achieve Efficient Diffusion Without Self-Attention? Distilling Self-Attention into Convolutions

DADet: Safeguarding Image Conditional Diffusion Models against Adversarial and Backdoor Attacks via Diffusion Anomaly Detection

Latent Diffusion Models with Masked AutoEncoders

Imbalance in Balance: Online Concept Balancing in Generation Models

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Multi-scenario Overlapping Text Segmentation with Depth Awareness

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Wasserstein Style Distribution Analysis and Transform for Stylized Image Generation

Instruction-based Image Editing with Planning, Reasoning, and Generation

Discovering Divergent Representations between Text-to-Image Models

DDB: Diffusion Driven Balancing to Address Spurious Correlations

HDR Image Generation via Gain Map Decomposed Diffusion

ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning

AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations

Accelerating Diffusion Transformer via Gradient-Optimized Cache

The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation

Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

ArtEditor: Learning Customized Instructional Image Editor from Few-Shot Examples

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Golden Noise for Diffusion Models: A Learning Framework

Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy

Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints

Learning Few-Step Diffusion Models by Trajectory Distribution Matching

End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

A3GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting

HouseTour: A Virtual Real Estate A(I)gent

Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts

LayerD: Decomposing Raster Graphic Designs into Layers

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Beyond Blur: A Fluid Perspective on Generative Diffusion Models

Conditional Visual Autoregressive Modeling for Pathological Image Restoration

Subjective Camera 1.0: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

GlassWizard: Harvesting Diffusion Priors for Glass Surface Detection

SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models

LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding

DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Free2Guide: Training-Free Text-to-Video Alignment using Image LVLM

Hybrid Layout Control for Diffusion Transformer: Fewer Annotations, Superior Aesthetics

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Dual-Process Image Generation

SpecGuard: Spectral Projection-based Advanced Invisible Watermarking

DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models

Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts

GFPack++: Attention-Driven Gradient Fields for Optimizing 2D Irregular Packing

Denoising Token Prediction in Masked Autoregressive Models

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

LACONIC: A 3D Layout Adapter for Controllable Image Creation

Preserve Anything: Controllable Image Synthesis with Object Preservation

Rethinking DPO-style Diffusion Aligning Frameworks

Certifiably Optimal Anisotropic Rotation Averaging

Generating Physically Stable and Buildable Brick Structures from Text

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation

LLM Thought Divergence and Convergence for Dialogue-Based Image Generation Control

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

TurboVSR: Fantastic Video Upscalers and Where to Find Them

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Improving Rectified Flow with Boundary Conditions

Decoding Correlation-Induced Misalignment in the Stable Diffusion Workflow for Text-to-Image Generation

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Parametric Shadow Control for Portrait Generation in Text-to-Image Diffusion Models

IGD: Instructional Graphic Design with Multimodal Layer Generation

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding

CompleteMe: Reference-based Human Image Completion

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

EEGMirror: Leveraging EEG data in the wild via Montage-Agnostic Self-Supervision for EEG to Video Decoding

Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer

Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping

UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

SUV: Suppressing Undesired Video Content via Semantic Modulation Based on Text Embeddings

Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models

Semantic Discrepancy-aware Detector for Image Forgery Identification

Scalable Ranked Preference Optimization for Text-to-Image Generation

FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions

ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement

Randomized Autoregressive Visual Generation

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

FonTS: Text Rendering With Typography and Style Controls

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction

HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

TCFG: Truncated Classifier-Free Guidance for Efficient and Scalable Text-to-Image Acceleration

Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Automated Red Teaming for Text-to-Image Models through Feedback-Guided Prompt Iteration with Vision-Language Models

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Di[M]O: Distilling Masked Diffusion Models into One-step Generator

Gain-MLP: Improving HDR Gain Map Encoding via a Lightweight MLP

TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images

Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Video-T1: Test-time Scaling for Video Generation

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

StyleSRN: Scene Text Image Super-Resolution with Text Style Embedding

Sparse Fine-Tuning of Transformers for Generative Tasks

FlexGen: Flexible Multi-View Generation from Text and Image Inputs

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Learning Implicit Features with Flow-Infused Transformations for Realistic Virtual Try-On

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction

Region-Level Data Attribution for Text-to-Image Generative Models

Learned Image Compression with Hierarchical Progressive Context Modeling

Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Teleportraits: Training-Free People Insertion into Any Scene

DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing

Context Guided Transformer Entropy Modeling for Video Compression

Deterministic Object Pose Confidence Region Estimation

WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction

UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint

CODA: Repurposing Continuous VAEs for Discrete Tokenization

DiffDoctor: Diagnosing Image Diffusion Models Before Treating

TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

LEGION: Learning to Ground and Explain for Synthetic Image Detection

DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Bi-Level Optimization for Self-Supervised AI-Generated Face Detection

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

From Imitation to Innovation: The Emergence of AI's Unique Artistic Styles and the Challenge of Copyright Protection

AnyPortal: Zero-Shot Consistent Video Background Replacement

Neighboring Autoregressive Modeling for Efficient Visual Generation

FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting

QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing

Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models

BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

An Empirical Study of Autoregressive Pre-training from Videos

Blended Point Cloud Diffusion for Localized Text-guided Shape Editing

Training-free Geometric Image Editing on Diffusion Models

Video Color Grading via Look-Up Table Generation

VSC: Visual Search Compositional Text-to-Image Diffusion Model

VideoAuteur: Towards Long Narrative Video Generation

Fine-Tuning Visual Autogressive Models for Subject-Driven Generation

Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent

Frequency-Guided Diffusion for Training-Free Text-Driven Image Translation

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

DLF: Extreme Image Compression with Dual-generative Latent Fusion

REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI

PixTalk: Controlling Photorealistic Image Processing and Editing with Language

ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification

A Unified Framework for Industrial Cel-Animation Colorization with Temporal-Structural Awareness

VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation

SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior

RAGD: Regional-Aware Diffusion Model for Text-to-Image Generation

Engage for All: Making Ordinary Image Descriptions Appealing Again!

Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

Generative Video Bi-flow

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers

T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

CopyrightShield: Enhancing Diffusion Model Security Against Copyright Infringement Attacks

Unified Video Generation via Next-Set Prediction in Continuous Domain

When and Where do Data Poisons Attack Textual Inversion?

Mobile Video Diffusion

LayerLock: Non-collapsing Representation Learning with Progressive Freezing

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

JPEG Processing Neural Operator for Backward-Compatible Coding

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

All Parts Matter: A Unified Mask-Free Virtual Try-On Framework

Function-centric Bayesian Network for Zero-Shot Object Goal Navigation

Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!

Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models

Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization

SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models

LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

MH-LVC: Multi-Hypothesis Temporal Prediction for Learned Conditional Residual Video Coding

Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

(ends 4:00 PM)

4:45 p.m.

Meeting:

PAMI TC Meeting

(ends 5:45 PM)

6:30 p.m.

Reception:

Reception

(ends 8:00 PM)

THU 23 OCT

7:30 a.m.

Break:

Breakfast

(ends 9:00 AM)

Registration/Badge Pickup

(ends 2:00 PM)

8 a.m.

Oral 5A: Content Generation [8:00-9:15]

Orals 8:00-9:30

[8:00] LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

[8:15] MikuDance: Animating Character Art with Mixed Motion Dynamics

[8:30] Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

[8:45] LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

[9:00] FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

[9:15] LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

(ends 9:15 AM)

Oral 5B: Applications and evaluation [8:00-9:15]

Orals 8:00-9:30

[8:00] ROAR: Reducing Inversion Error in Generative Image Watermarking

[8:15] Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

[8:30] Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

[8:45] Counting Stacked Objects

[9:00] MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration

[9:15] Soft Local Completeness: Rethinking Completeness in XAI

(ends 9:15 AM)

9:30 a.m.

Keynote:

The efficiency of learner generated experiences

Linda B Smith

(ends 10:30 AM)

10:45 a.m.

Demonstration:

Demos 5

(ends 12:45 PM)

Poster Session 5 & Exhibit Hall [10:45-12:45]

Posters 11:15-1:15

On the Provable Importance of Gradients for Autonomous Language-Assisted Image Clustering

Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation

Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos

CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

Multi-modal Segment Anything Model for Camouflaged Scene Segmentation

OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

MRGen: Segmentation Data Engine For Underrepresented MRI Modalities

An Efficient Hybrid Vision Transformer for TinyML Applications

Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception

Understanding Personal Concept in Open-Vocabulary Semantic Segmentation

CoralSRT: Revisiting Coral Reef Semantic Segmentation by Feature Rectifying via Self-supervised Guidance

CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

Visual Test-time Scaling for GUI Agent Grounding

Multi-Schema Proximity Network for Composed Image Retrieval

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

DiSCO-3D : Discovering and Segmenting Sub-Concepts from Open-vocabulary Queries in NeRF

ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detection

Growing a Twig to Accelerate Large Vision-Language Models

Test-time Adaptation for Foundation Medical Segmentation Model Without Parametric Updates

ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Task Vector Quantization for Memory-Efficient Model Merging

M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast

Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Moment Quantization for Video Temporal Grounding

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

ProbMED: A Probabilistic Framework for Medical Multimodal Binding

VCA: Video Curious Agent for Long Video Understanding

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

Aligning Moments in Time using Video Queries

S⁴M: Boosting Semi-Supervised Instance Segmentation with SAM

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning

Ensemble Foreground Management for Unsupervised Object Discovery

ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation

DiffPS: Leveraging Prior Knowledge of Diffusion Model for Person Search

Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation

Two Losses, One Goal: Balancing Conflict Gradients for Semi-supervised Semantic Segmentation

Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

VideoOrion: Tokenizing Object Dynamics in Videos

Leveraging Debiased Cross-modal Attention Maps and Code-based Reasoning for Zero-shot Referring Expression Comprehension

ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing

Representation Shift: Unifying Token Compression with FlashAttention

Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation

ZipVL: Accelerating Vision-Language Models through Dynamic Token Sparsity

ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

LaCoOT: Layer Collapse through Optimal Transport

Zero-Shot Compositional Video Learning with Coding Rate Reduction

DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup

End-to-End Multi-Modal Diffusion Mamba

Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild

G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection

Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation

Fuzzy Contrastive Decoding to Alleviate Object Hallucination in Large Vision-Language Models

BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Counting Stacked Objects

LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

NETracer: A Topology-Aware Iterative Tracing Approach for Tubular Structure Extraction

Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels

Semantic versus Identity: A Divide-and-Conquer Approach towards Adjustable Medical Image De-Identification

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection

Factorized Learning for Temporally Grounded Video-Language Models

Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

G2PDiffusion: Cross-species Genotype-to-Phenotype Prediction via Evolutionary Diffusion

Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Superpowering Open-Vocabulary Object Detectors for X-ray Vision

RhythmGuassian: Repurposing Generalizable Gaussian Model For Remote Physiological Measurement

Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Stereo Any Video: Temporally Consistent Stereo Matching

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Wave-MambaAD: Wavelet-driven State Space Model for Multi-class Unsupervised Anomaly Detection

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

Visual Textualization for Image Prompted Object Detection

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching

UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale

On the Recovery of Cameras from Fundamental Matrices

SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance

The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning

CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization

CLIPSym: Delving into Symmetry Detection with CLIP

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Robustifying Zero-Shot Vision Language Models by Subspaces Alignment

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OVG-HQ: Online Video Grounding with Hybrid-modal Queries

Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention

SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images

Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models

Probabilistic Prototype Calibration of Vision-language Models for Generalized Few-shot Semantic Segmentation

Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

OuroMamba: A Data-Free Quantization Framework for Vision Mamba

MixA: A Mixed Attention approach with Stable Lightweight Linear Attention to enhance Efficiency of Vision Transformers at the Edge

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images

FE-CLIP: Frequency Enhanced CLIP Model for Zero-Shot Anomaly Detection and Segmentation

Referring Expression Comprehension for Small Objects

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Principles of Visual Tokens for Efficient Video Understanding

CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Text-guided Visual Prompt DINO for Generic Segmentation

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval

Cracking Instance Jigsaw Puzzles: A Superior Alternative to Multiple Instance Learning for Whole Slide Image Analysis

STDDNet: Harnessing Mamba for Video Polyp Segmentation via Spatial-aligned Temporal Modeling and Discriminative Dynamic Representation Learning

Latent Expression Generation for Referring Image Segmentation and Grounding

Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

CogCM: Cognition-Inspired Contextual Modeling for Audio-Visual Speech Enhancement

Salvaging the Overlooked: Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection

Adapting In-Domain Few-Shot Segmentation to New Domains without Source Domain Retraining

OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation

COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets

MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration

MikuDance: Animating Character Art with Mixed Motion Dynamics

FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

GaussianReg: Rapid 2D/3D Registration for Emergency Surgery via Explicit 3D Modeling with Gaussian Primitives

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Rectifying Magnitude Neglect in Linear Attention

Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs

CountSE: Soft Exemplar Open-set Object Counting

Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Unified Open-World Segmentation with Multi-Modal Prompts

DecAD: Decoupling Anomalies in Latent Space for Multi-Class Unsupervised Anomaly Detection

Few-Shot Pattern Detection via Template Matching and Regression

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement

Hallucinatory Image Tokens: A Training-free EAZY Approach to Detecting and Mitigating Object Hallucinations in LVLMs

Class Token as Proxy: Optimal Transport-assisted Proxy Learning for Weakly Supervised Semantic Segmentation

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Referring to Any Person

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text matching

RA-BUSSeg: Relation-aware Semi-supervised Breast Ultrasound Image Segmentation via Adjacent Propagation and Cross-layer Alignment

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D

CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation

Describe Anything: Detailed Localized Image and Video Captioning

ViSpeak: Visual Instruction Feedback in Streaming Videos

Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications

VideoAds for Fast-Paced Video Understanding

Auto-Controlled Image Perception in MLLMs via Visual Perception Tokens

Keep Your Friends Close, and Your Enemies Farther: Distance-aware Voxel-wise Contrastive Learning for Semi-supervised Multi-organ Segmentation

SALAD -- Semantics-Aware Logical Anomaly Detection

Refer to Any Segmentation Mask Group With Vision-Language Prompts

From Panels to Prose: Generating Literary Narratives from Comics

When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation

MSQ: Memory-Efficient Bit Sparsification Quantization

Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation

Improving SAM for Camouflaged Object Detection via Dual Stream Adapters

Triad: Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process

Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

Bridging the Gap between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding

MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices

DisTime: Distribution-based Time Representation for Video Large Language Models

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

WeaveSeg: Iterative Contrast-weaving and Spectral Feature-refining for Nuclei Instance Segmentation

How Can Objects Help Video-Language Understanding?

Everything is a Video: Unifying Modalities through Next-Frame Prediction

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

CARIM: Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Modeling Saliency Dataset Bias

ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models

HyPiDecoder: Hybrid Pixel Decoder for Efficient Segmentation and Detection

Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge

Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection

MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Controllable Latent Space Augmentation for Digital Pathology

PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

MIEB: Massive Image Embedding Benchmark

Temperature in Cosine-based Softmax Loss

Interpretable point cloud classification using multiple instance learning

Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation

DeFSS: Image-to-Mask Denoising Learning for Few-shot Segmentation

Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ROAR: Reducing Inversion Error in Generative Image Watermarking

Efficient Fine-Tuning of Large Models via Nested Low-Rank Adaptation

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Object-centric Video Question Answering with Visual Grounding and Referring

Music Grounding by Short Video

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

ResidualViT for Efficient Temporally Dense Video Encoding

Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts

Progressive Test Time Energy Adaptation for Medical Image Segmentation

Implicit Counterfactual Learning for Audio-Visual Segmentation

Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment

Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes

AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model

Pseudo-SD: Pseudo Controlled Stable Diffusion for Semi-Supervised and Cross-Domain Semantic Segmentation

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions

Learning Beyond Still Frames: Scaling Vision-Language Models with Video

Is CLIP ideal? No. Can we fix it? Yes!

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Dynamic Dictionary Learning for Remote Sensing Image Segmentation

Temporal-aware Query Routing for Real-time Video Instance Segmentation

WINS: Winograd Structured Pruning for Fast Winograd Convolution

Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

Towards Fine-grained Interactive Segmentation in Images and Videos

LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection

Learnable Retrieval Enhanced Visual-Text Alignment and Fusion for Radiology Report Generation

Generalizable Object Re-Identification via Visual In-Context Prompting

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Anomaly Detection of Integrated Circuits Package Substrates Using the Large Vision Model SAIC: Dataset Construction, Methodology, and Application

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Streaming VideoLLMs for Real-Time Procedural Video Understanding

Prompt-driven Transferable Adversarial Attack on Person Re-Identification with Attribute-aware Textual Inversion

Debiased Curriculum Adaptation for Safe Transfer Learning in Chest X-ray Classification

Frequency-Dynamic Attention Modulation For Dense Prediction

Memory-Efficient 4-bit Preconditioned Stochastic Optimization

Cross-Category Subjectivity Generalization for Style-Adaptive Sketch Re-ID

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation

Aligning Effective Tokens with Video Anomaly in Large Language Models

No More Sibling Rivalry: Debiasing Human-Object Interaction Detection

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Borrowing Eyes for the Blind Spot: Overcoming Data Scarcity in Malicious Video Detection via Cross-Domain Retrieval Augmentation

Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba for End-to-end Whole Slide Image Analysis

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Sim-DETR: Unlock DETR for Temporal Sentence Grounding

ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework

DIH-CLIP: Unleashing the Diversity of Multi-Head Self-Attention for Training-Free Open-Vocabulary Semantic Segmentation

SignRep: Enhancing Self-Supervised Sign Representations

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories

Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Task-Specific Zero-shot Quantization-Aware Training for Object Detection

Boosting Multimodal Learning via Disentangled Gradient Learning

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

LVBench: An Extreme Long Video Understanding Benchmark

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning

ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches

Similarity Memory Prior is All You Need for Medical Image Segmentation

CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor

Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model

Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

CoSMIC: Continual Self-supervised Learning for Multi-Domain Medical Imaging via Conditional Mutual Information Maximization

Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching

Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning

MSA2: Multi-task Framework with Structure-aware and Style-adaptive Character Representation for Open-set Chinese Text Recognition

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Always Skip Attention

Training-Free Class Purification for Open-Vocabulary Semantic Segmentation

SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Breaking the Encoder Barrier for Seamless Video-Language Understanding

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

A Token-level Text Image Foundation Model for Document Understanding

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

Cross-Architecture Distillation Made Simple with Redundancy Suppression

HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model

DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation

FIND: Few-Shot Anomaly Inspection with Normal-Only Multi-Modal Data

VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference

KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding

PVChat: Personalized Video Chat with One-Shot Learning

Unsupervised Histopathological Image Semantic Segmentation with Overlapping Patches Consistency Constraint

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents

Adapt Foundational Segmentation Models with Heterogeneous Searching Space

VIPerson: Flexibly Generating Virtual Identity for Person Re-Identification

SpikePack: Enhanced Information Flow in Spiking Neural Networks with High Hardware Compatibility

AdsQA: Towards Advertisement Video Understanding

Towards Robustness of Person Search against Corruptions

Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts

PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology

Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations

Bringing RNNs Back to Efficient Open-Ended Video Understanding

Scheduling Weight Transitions for Quantization-Aware Training

Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

Visual Relation Diffusion for Human-Object Interaction Detection

Flow-MIL: Constructing Highly-expressive Latent Feature Space For Whole Slide Image Classification Using Normalizing Flow

HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss

CompCap: Improving Multimodal Large Language Models with Composite Captions

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Stable Diffusion Models are Secretly Good at Visual In-Context Learning

FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance

Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking

Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation

Seeing the Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation

ViLLa: Video Reasoning Segmentation with Large Language Model

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Object-level Correlation for Few-Shot Segmentation

Vision-Language Neural Graph Featurization for Extracting Retinal Lesions

SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting

RadGPT: Constructing 3D Image-Text Tumor Datasets

SPA: Efficient User-Preference Alignment against Uncertainty in Medical Image Segmentation

CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

LangBridge: Interpreting Image as a Combination of Language Embeddings

Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

Soft Local Completeness: Rethinking Completeness in XAI

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

ZIM: Zero-Shot Image Matting for Anything

An OpenMind for 3D Medical Vision Self-supervised Learning

Make Your Training Flexible: Towards Deployment-Efficient Video Models

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

D-Attn: Decomposed Attention for Large Vision-and-Language Model

Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

MINERVA: Evaluating Complex Video Reasoning

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in Medical Image Segmentation with Learnable Prior

CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation

ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality

NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

Toward Fair and Accurate Cross-Domain Medical Image Segmentation: A VLM-Driven Active Domain Adaptation Paradigm

Incremental Few-Shot Semantic Segmentation via Multi-Level Switchable Visual Prompts

TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal

Vision-Language Models Can't See the Obvious

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution

Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Training-Free Industrial Defect Generation with Diffusion Models

Kaputt: A Large-Scale Dataset for Visual Defect Detection

Snakes and Ladders: Two Steps Up for VideoMamba

ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models

Images as Noisy Labels: Unleashing the Potential of the Diffusion Model for Open-Vocabulary Semantic Segmentation

Auto-Vocabulary Semantic Segmentation

SIC: Similarity-Based Interpretable Image Classification with Neural Networks

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection

Hierarchy-Aware Pseudo Word Learning with Text Adaptation for Zero-Shot Composed Image Retrieval

Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning

Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training

Agreement aware and dissimilarity oriented GLOM

Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning

C2MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Breaking Grid Constraints: Dynamic Graph Reconstruction Network for Multi-organ Segmentation

MaskSAM: Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation

Large-scale Pre-training for Grounded Video Caption Generation

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection

Emulating Self-attention with Convolution for Efficient Image Super-Resolution

RareCLIP: Rarity-aware Online Zero-shot Industrial Anomaly Detection

MEH: A Multi-Style Dataset and Toolkit for Advancing Egyptian Hieroglyph Recognition

Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval

Unbiased Missing-modality Multimodal Learning

ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

DM-EFS: Dynamically Multiplexed Expanded Features Set Form for Robust and Efficient Small Object Detection

DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes

YOLOE: Real-Time Seeing Anything

Mixture-of-Scores: Robust Image-Text Data Valuation via Three Lines of Code

Allowing Oscillation Quantization: Overcoming Solution Space Limitation in Low Bit-Width Quantization

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

(ends 12:45 PM)

11 a.m.

Break:

Lunch

(ends 1:00 PM)

1 p.m.

Oral 6A: Physical Scene Perception [1:00-2:15]

Orals 1:15-2:30

[1:15] SuperDec: 3D Scene Decomposition with Superquadrics Primitives

[1:30] Diffusion Image Prior

[1:45] Spatially-Varying Autofocus

[2:00] Towards Foundational Models for Single-Chip Radar

[2:15] Event-based Visual Vibrometry

(ends 2:15 PM)

Oral 6B: Segmentation and grouping [1:00-2:15]

Orals 1:15-2:30

[1:15] CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

[1:30] E-SAM: Training-Free Segment Every Entity Model

[1:45] Online Reasoning Video Segmentation with Just-in-Time Digital Twins

[2:00] Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

[2:15] ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

(ends 2:15 PM)

2:30 p.m.

Poster Session 6 & Exhibit Hall with Coffee Break [2:30-4:30]

Posters 2:30-4:45

Stealthy Backdoor Attack in Federated Learning via Adaptive Layer-wise Gradient Alignment

X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

Inverse Image-Based Rendering for Light Field Generation from Single Images

HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration

CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Axis-level Symmetry Detection with Group-Equivariant Representation

STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene

Monocular Semantic Scene Completion via Masked Recurrent Networks

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Bolt3D: Generating 3D Scenes in Seconds

PossLoss: A Reliable and Sensitive Facial Landmark Detection Loss Function

Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency

Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging

DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization

OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

LookOut: Real-World Humanoid Egocentric Navigation

PointGAC: Geometric-Aware Codebook for Masked Point Modeling

Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

PRM: Photometric Stereo based Large Reconstruction Model

4D Gaussian Splatting SLAM

BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis

Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors

Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Discretized Gaussian Representation for Tomographic Reconstruction

SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates

RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors

Dual-S3D: Hierarchical Dual-Path Selective SSM-CNN for High-Fidelity Implicit Reconstruction

FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction

Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction

RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints

Gaussian Splatting with Discretized SDF for Relightable Assets

MMGeo: Multimodal Compositional Geo-Localization for UAVs

AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes

Large Scene Generation with Cube-Absorb Discrete Diffusion

SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

Robust Unfolding Network for HDR Imaging with Modulo Cameras

Neural Shell Texture Splatting: More Details and Fewer Primitives

Gaussian-based World Model: Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction

Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

A Real-world Display Inverse Rendering Dataset

DAA*: Deep Angular A Star for Image-based Path Planning

Neural Compression for 3D Geometry Sets

Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion

MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency

GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion

GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views

GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization

REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment

Towards Safer and Understandable Driver Intention Prediction

Revisiting Point Cloud Completion: Are We Ready For The Real-World?

V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals

EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting

NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction

Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs

Spatially-Varying Autofocus

Event-based Visual Vibrometry

Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting

Lifting the Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling

Global Regulation and Excitation via Attention Tuning for Stereo Matching

Global-Aware Monocular Semantic Scene Completion with State Space Models

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

GAP: Gaussianize Any Point Clouds with Text Guidance

Semantic-guided Camera Ray Regression for Visual Localization

SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Polarimetric Neural Field via Unified Complex-Valued Wave Representation

High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach

OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos

Street Gaussians without 3D Object Tracker

RogSplat: Robust Gaussian Splatting via Generative Priors

HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity

RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Scene Coordinate Reconstruction Priors

Diff2I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior

Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments

I2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation

Serialization based Point Cloud Oversegmentation

StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams

RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction

Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images

TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Online Language Splatting

Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors

Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting

MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy

CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving

A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields

Free-running vs Synchronous: Single-Photon Lidar for High-flux 3D Imaging

Mitigating Geometric Degradation in Fast DownSampling via FastAdapter for Point Cloud Segmentation

Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising

ArchiSet: Benchmarking Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios

ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

Discontinuity-aware Normal Integration for Generic Central Camera Models

Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

SEHDR: Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing

Omni-scene Perception-oriented Point Cloud Geometry Enhancement for Coordinate Quantization

SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation

TARS: Traffic-Aware Radar Scene Flow Estimation

DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection

FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation

Leaps and Bounds: An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction

GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior

From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning

Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching

DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model

EDM: Efficient Deep Feature Matching

Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

SuperDec: 3D Scene Decomposition with Superquadrics Primitives

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors

NeRF Is a Valuable Assistant for 3D Gaussian Splatting

UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images

Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction

TOTP: Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion

AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model

7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting

StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting

MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling

DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

TurboReg: TurboClique for Robust and Efficient Point Cloud Registration

ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

Efficient Spiking Point Mamba for Point Cloud Analysis

SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images

Splat-based 3D Scene Reconstruction with Extreme Motion-blur

Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves

GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections

PolarAnything: Diffusion-based Polarimetric Image Synthesis

LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP

ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models

MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction

RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning

Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models

Occupancy Learning with Spatiotemporal Memory

Towards Open-World Generation of Stereo Images and Unsupervised Matching

DMesh++: An Efficient Differentiable Mesh for Complex Shapes

TAD-E2E: A Large-scale End-to-end Autonomous Driving Dataset

LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering

Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds

S$^3$E: Self-Supervised State Estimation for Radar-Inertial System

MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

3D Test-time Adaptation via Graph Spectral Driven Point Shift

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping

MiDSummer: Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation

AG2aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing

Hierarchical 3D Scene Graphs Construction Outdoors

Spatio-Spectral Pattern Illumination for Direct and Indirect Separation from a Single Hyperspectral Image

SDFormer: Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer

Adversarial Exploitation of Data Diversity Improves Visual Localization

SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations

GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting

GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-based Transformer

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

Tile-wise vs. Image-wise: Random-Tile Loss and Training Paradigm for Gaussian Splatting

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Explaining Human Preferences via Metrics for Structured 3D Reconstruction

CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving

RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

Inverse 3D Microscopy Rendering for Cell Shape Inference with Active Mesh

E-SAM: Training-Free Segment Every Entity Model

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

GaussRender: Learning 3D Occupancy with Gaussian Rendering

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors

LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization

CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

Supercharging Floorplan Localization with Semantic Rays

RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Planar Affine Rectification from Local Change of Scale and Orientation

ERNet: Efficient Non-Rigid Registration Network for Point Sequences

SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection

Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Epona: Autoregressive Diffusion World Model for Autonomous Driving

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

MCOP: Multi-UAV Collaborative Occupancy Prediction

Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

Leveraging Local Patch Alignment to Seam-cutting for Large Parallax Image Stitching

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

IntrinsicControlNet: Cross-distribution Image Generation with Real and Unreal

SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

WIPES: Wavelet-based Visual Primitives

DiffPCI: Large Motion Point Cloud frame Interpolation with Diffusion Model

UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis

ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

Thermal Polarimetric Multi-view Stereo

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories

AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion

MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography

Sparfels: Fast Reconstruction from Sparse Unposed Imagery

GenFlow3D: Generative Scene Flow Estimation and Prediction on Point Cloud Sequences

Self-Supervised Sparse Sensor Fusion for Long Range Perception

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception

Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

TransiT: Transient Transformer for Non-line-of-sight Videography

Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues

Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion

RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

SynCity: Training-Free Generation of 3D Cities

RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes

Tree Skeletonization from 3D Point Clouds by Denoising Diffusion

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner

Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping

Purge-Gate: Efficient Backpropagation-Free Test-Time Adaptation for Point Clouds via Token purging

AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering

SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video

UNIS: A Unified Framework for Achieving Unbiased Neural Implicit Surfaces in Volume Rendering

BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns

Towards Foundational Models for Single-Chip Radar

Diffusion Image Prior

FlowR: Flowing from Sparse to Dense 3D Reconstructions

WorldScore: Unified Evaluation Benchmark for World Generation

Perspective-Invariant 3D Object Detection

CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection

LightSwitch: Multi-view Relighting with Material-guided Diffusion

Decoupled Diffusion Sparks Adaptive Scene Generation

Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior

ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

EYE3:Turn Anything into Naked-eye 3D

NATRA: Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations

SP2T: Sparse Proxy Attention for Dual-stream Point Transformer

Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting

CF3: Compact and Fast 3D Feature Fields

When Anchors Meet Cold Diffusion: A Multi-Stage Approach to Lane Detection

2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update

Learning Neural Scene Representation from iToF Imaging

No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues

Faster and Better 3D Splatting via Group Training

Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

Recovering Parametric Scenes from Very Few Time-of-Flight Pixels

NeuFrameQ: Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation

FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging

RTMap: Real-Time Recursive Mapping with Change Detection and Localization

CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization

Controllable 3D Outdoor Scene Generation via Scene Graphs

CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance

PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Driving View Synthesis on Free-form Trajectories with Generative Prior

ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery

SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding

Constraint-Aware Feature Learning for Parametric Point Cloud

PixelStitch: Structure-Preserving Pixel-Wise Bidirectional Warps for Unsupervised Image Stitching

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

Correspondence-Free Fast and Robust Spherical Point Pattern Registration

NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement

ZeroStereo: Zero-shot Stereo Matching from Single Images

CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

Stochastic Gradient Estimation for Higher-Order Differentiable Rendering

M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking

Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging

TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives

Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes

VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data

χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement

Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception

Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds

Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction

ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration

Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding

Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching

R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception

V2XScenes: A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception

mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework

SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection

ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity

Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures

Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View

Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency

SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

SAM4D: Segment Anything in Camera and LiDAR Streams

StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models

Scaling Transformer-Based Novel View Synthesis with Models Token Disentanglement and Synthetic Data

LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning

Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception

GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination

Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction

VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions

TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views

A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Extrapolated Urban View Synthesis Benchmark

Heatmap Regression without Soft-Argmax for Facial Landmark Detection

Demeter: A Parametric Model of Crop Plant Morphology from the Real World

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control

Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation

ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment

FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images

Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering

HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes

AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction

Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving

BANet: Bilateral Aggregation Network for Mobile Stereo Matching

Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions

Authentic 4D Driving Simulation with a Video Generation Model

DONUT: A Decoder-Only Model for Trajectory Prediction

Lidar Waveforms are Worth 40x128x33 Words

Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation

ContraGS: Codebook-Condensed and Trainable Gaussian Splatting for Fast, Memory-Efficient Reconstruction

UAVScenes: A Multi-Modal Dataset for UAVs

PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction

Seam360GS: Seamless 360° Gaussian Splatting from Real-World Omnidirectional Images

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering

Wide2Long: Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances

Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array

IM360: Large-scale Indoor Mapping with 360 Cameras

Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering

CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering

Φ-GAN:Physics-Inspired GAN for Generating SAR Images Under Limited Data

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection

Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes

GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis

Super Resolved Imaging with Adaptive Optics

HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network

(ends 4:30 PM)

Demonstration:

Demos 6

(ends 4:30 PM)