ICCV 2025 Papers

Layout:

mini compact topic detail

Real3D: Towards Scaling Large Reconstruction Models with Real Images

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

ContraGS: Codebook-Condensed and Trainable Gaussian Splatting for Fast, Memory-Efficient Reconstruction

ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

VALLR: Visual ASR Language Model for Lip Reading

FREE-Merging: Fourier Transform for Efficient Model Merging

Chimera: Improving Generalist Model with Domain-Specific Experts

Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Model

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling

PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology

SAS: Segment Any 3D Scene with Integrated 2D Priors

GloPER: Unsupervised Animal Pattern Extraction from Local Reconstruction

StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Can Knowledge be Transferred from Unimodal to Multimodal? Investigating the Transitivity of Multimodal Knowledge Editing

Token Activation Map to Visually Explain Multimodal LLMs

Probabilistic Prototype Calibration of Vision-language Models for Generalized Few-shot Semantic Segmentation

Learning Interpretable Queries for Explainable Image Classification with Information Pursuit

Long-Tailed Classification with Multi-Granularity Semantics

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

Auto-Regressive Transformation for Image Alignment

LMM-Det: Make Large Multimodal Models Excel in Object Detection

Attention to the Burtiness in Visual Prompt Tuning!

Diffusion-Based Imaginative Coordination for Bimanual Manipulation

Learning Neural Scene Representation from iToF Imaging

ChartCap: Mitigating Hallucination of Dense Chart Captioning

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Weakly-Supervised Learning of Dense Functional Correspondences

PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Staining and Locking Computer Vision Models Without Retraining

Test-Time Prompt Tuning for Zero-Shot Depth Completion

TITAN: Query-Token based Domain Adaptive Adversarial Learning

Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation

StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Motion Synthesis with Sparse and Flexible Keyjoint Control

StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

D-Attn: Decomposed Attention for Large Vision-and-Language Model

InfoBridge: Balanced Multimodal Integration through Conditional Dependency Modeling

Closed-Loop Transfer for Weakly-supervised Affordance Grounding

Exploring View Consistency for Scene-Adaptive Low-Light Light Field Image Enhancement

Neuromanifold-Regularized KANs for Shape-fair Feature Representations

Training-Free Class Purification for Open-Vocabulary Semantic Segmentation

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

MonoSOWA: Scalable monocular 3D Object detector Without human Annotations

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval

Revisiting Point Cloud Completion: Are We Ready For The Real-World?

Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions

UnZipLoRA: Separating Content and Style from a Single Image

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

Learning Visual Proxy for Compositional Zero-Shot Learning

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Principles of Visual Tokens for Efficient Video Understanding

DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

Progressive Artwork Outpainting via Latent Diffusion Models

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

CObL: Toward Zero-Shot Ordinal Layering without User Prompting

Hierarchical Material Recognition from Local Appearance

Event-guided Unified Framework for Low-light Video Enhancement, Frame Interpolation, and Deblurring

DisenQ: Disentangling Q-Former for Activity-Biometrics

PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

GIViC: Generative Implicit Video Compression

Aligning Moments in Time using Video Queries

ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

Streamlining Image Editing with Layered Diffusion Brushes

MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

GECO: Geometrically Consistent Embedding with Lightspeed Inference

Removing Cost Volumes from Optical Flow Estimators

HouseTour: A Virtual Real Estate A(I)gent

Scheduling Weight Transitions for Quantization-Aware Training

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Event-based Visual Vibrometry

Mobile Video Diffusion

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising

Multi-modal Identity Extraction

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

FaceXFormer: A Unified Transformer for Facial Analysis

Laboring on less labors: RPCA Paradigm for Pan-sharpening

Riemannian-Geometric Fingerprints of Generative Models

LDIP: Long Distance Information Propagation for Video Super-Resolution

Multi-identity Human Image Animation with Structural Video Diffusion

FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning

LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification

Meta-Learning Dynamic Center Distance: Hard Sample Mining for Learning with Noisy Labels

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

DM-EFS: Dynamically Multiplexed Expanded Features Set Form for Robust and Efficient Small Object Detection

Inverse Image-Based Rendering for Light Field Generation from Single Images

PossLoss: A Reliable and Sensitive Facial Landmark Detection Loss Function

DAA*: Deep Angular A Star for Image-based Path Planning

Pseudo-SD: Pseudo Controlled Stable Diffusion for Semi-Supervised and Cross-Domain Semantic Segmentation

Frequency-Dynamic Attention Modulation For Dense Prediction

Memory-Efficient 4-bit Preconditioned Stochastic Optimization

Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability

ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives

Target Bias Is All You Need: Zero-Shot Debiasing of Vision-Language Models with Bias Corpus

Long-Context State-Space Video World Models

PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Fine-grained Spatiotemporal Grounding on Egocentric Videos

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Improving SAM for Camouflaged Object Detection via Dual Stream Adapters

FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields

Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement

Trust but Verify: Programmatic VLM Evaluation in the Wild

MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Task-Aware Prompt Gradient Projection for Parameter-Efficient Tuning Federated Class-Incremental Learning

Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection

Multimodal LLM Guided Exploration and Active Mapping using Fisher Information

Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

CLIPSym: Delving into Symmetry Detection with CLIP

UDC-VIT: A Real-World Video Dataset for Under-Display Cameras

ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment

Temperature in Cosine-based Softmax Loss

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Understanding Museum Exhibits using Vision-Language Reasoning

Neural Solver of Dichromatic Reflection Model for Specular Highlight Removal

Correspondence-Free Fast and Robust Spherical Point Pattern Registration

SILO: Solving Inverse Problems with Latent Operators

Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning

SD2Actor: Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation

Imbalance in Balance: Online Concept Balancing in Generation Models

Progressive Distribution Bridging: Unsupervised Adaptation for Large-scale Pre-trained Models via Adaptive Auxiliary Data

Efficient Concertormer for Image Deblurring and Beyond

TryOn-Refiner: Conditional Rectified-flow-based TryOn Refiner for More Accurate Detail Reconstruction

ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy

IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras

One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

LLM Thought Divergence and Convergence for Dialogue-Based Image Generation Control

RALoc: Enhancing Outdoor LiDAR Localization via Rotation Awareness

Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning

HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation

Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

Spatial Preference Rewarding for MLLMs Spatial Understanding

Generative Zoo

Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment

Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity

St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

Describe Anything: Detailed Localized Image and Video Captioning

Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion

Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework

AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model

MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost

InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling

Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

Effective Training Data Synthesis for Improving MLLM Chart Understanding

SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Improving Rectified Flow with Boundary Conditions

Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing

TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning

DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection

Scaling and Taming Adversarial Training with Synthetic Data

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Generative Adversarial Diffusion

Music Grounding by Short Video

Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment

Your Text Encoder Can Be An Object-Level Watermarking Controller

Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation

PlugMark: A Plug-in Zero-Watermarking Framework for Diffusion Models

GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination

Diff2I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior

EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling

Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information

KOEnsAttack: Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles

CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

PanSt3R: Multi-view Consistent Panoptic Segmentation

Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints

GARF: Learning Generalizable 3D Reassembly for Real-World Fractures

PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Where, What, Why: Towards Explainable Driver Attention Prediction

TRNAS: A Training-Free Robust Neural Architecture Search

Dynamic Multimodal Prototype Learning in Vision-Language Models

CAP: Evaluation of Persuasive and Creative Image Generation

SummDiff: Generative Modeling of Video Summarization with Diffusion

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams

Unleashing Vecset Diffusion Model for Fast Shape Generation

Auto-Regressively Generating Multi-View Consistent Images

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection

Towards Performance Consistency in Multi-Level Model Collaboration

Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests

FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling

DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Understanding Personal Concept in Open-Vocabulary Semantic Segmentation

DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization

ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting

Video-T1: Test-time Scaling for Video Generation

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Discovering Divergent Representations between Text-to-Image Models

VRM: Knowledge Distillation via Virtual Relation Matching

SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

LGA-Net: Learning Local and Global Affinities for Sparse Scribble based Image Colorization

Backdoor Attacks on Neural Networks via One-Bit Flip

Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths

O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views

SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures

HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model

Semi-supervised Concept Bottleneck Models

Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

WINS: Winograd Structured Pruning for Fast Winograd Convolution

ART: Adaptive Relation Tuning for Generalized Relation Prediction

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion

Factorized Learning for Temporally Grounded Video-Language Models

FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

External Knowledge Injection for CLIP-Based Class-Incremental Learning

Cooperative Pseudo Labeling for Unsupervised Federated Classification

DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Model

AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Activation Subspaces for Out-of-Distribution Detection

PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

Differentially Private Fine-Tuning of Diffusion Models

IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

Multi-turn Consistent Image Editing

A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention

CAFA: a Controllable Automatic Foley Artist

Unknown Text Learning for CLIP-based Few-Shot Open-set Recognition

Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

Personalized Federated Learning under Local Supervision

Multi-View 3D Point Tracking

Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation

Learning Separable Fine-Grained Representation via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition

PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization

PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation

Language-Driven Multi-Label Zero-Shot Learning with Semantic Granularity

Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence

Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations

Learning an Implicit Physics Model for Image-based Fluid Simulation

Less is More: Empowering GUI Agent with Context-Aware Simplification

Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing

IM360: Large-scale Indoor Mapping with 360 Cameras

EventUPS: Uncalibrated Photometric Stereo Using an Event Camera

Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration

Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function

Harnessing Input-Adaptive Inference for Efficient VLN

When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

PersonaCraft: Personalized and Controllable Full-Body Multi-Human Scene Generation Using Occlusion-Aware 3D-Conditioned Diffusion

MotionFollower: Editing Video Motion via Score-Guided Diffusion

Online Generic Event Boundary Detection

A Recipe for Generating 3D Worlds from a Single Image

Guiding Diffusion Models with Adaptive Negative Sampling Without External Resources

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate

DiffDoctor: Diagnosing Image Diffusion Models Before Treating

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Video Motion Graphs

Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation

Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Hallucinatory Image Tokens: A Training-free EAZY Approach to Detecting and Mitigating Object Hallucinations in LVLMs

Keep Your Friends Close, and Your Enemies Farther: Distance-aware Voxel-wise Contrastive Learning for Semi-supervised Multi-organ Segmentation

Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines

Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

TransiT: Transient Transformer for Non-line-of-sight Videography

LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association

Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting

On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

Learning to Inference Adaptively for Multimodal Large Language Models

Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification

Hierarchical Divide-and-Conquer Grouping for Classification Adaptation of Pre-Trained Models

Lark: Low-Rank Updates After Knowledge Localization for Few-shot Class-Incremental Learning

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

A Conditional Probability Framework for Compositional Zero-shot Learning

Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning

BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

RANKCLIP: Ranking-Consistent Language-Image Pretraining

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving

SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models

Uncalibrated Structure from Motion on a Sphere

Prototype-based Contrastive Learning with Stage-wise Progressive Augmentation for Self-Supervised Fine-Grained Learning

Radiant Foam: Real-Time Differentiable Ray Tracing

COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition

Information Density Principle for MLLM Benchmarks

ReTracker: Exploring Image Matching for Robust Online Any Point Tracking

Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation

Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy

Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning

Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks

Differentiable Room Acoustic Rendering with Multi-View Vision Priors

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

SplatTalk: 3D VQA with Gaussian Splatting

Joint Diffusion Models in Continual Learning

GT-Loc: Unifying When and Where in Images through a Joint Embedding Space

TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

Multimodal Large Language Model-Guided ISP Hyperparameter Optimization with Dynamic Preference Learning

VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers

MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics

FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection

CAVIS: Context-Aware Video Instance Segmentation

Adversarial Purification via Super-Resolution and Diffusion

FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

CMAD: Correlation-Aware and Modalities-Aware Distillation for Multimodal Sentiment Analysis with Missing Modalities

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Revelio: Interpreting and leveraging semantic information in diffusion models

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

MUNBa: Machine Unlearning via Nash Bargaining

Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection

Improved Noise Schedule for Diffusion Training

Secure On-Device Video OOD Detection Without Backpropagation

Learning Counterfactually Decoupled Attention for Open-World Model Attribution

Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning

Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation

WIPES: Wavelet-based Visual Primitives

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning

Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

One Encoder to Rule them All: Representation Learning for Model-free Visual Reinforcement Learning using Fourier Neural Operators

Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

X-Fusion: Introducing New Modality to Frozen Large Language Models

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction

Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration

LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

CIARD: Cyclic Iterative Adversarial Robustness Distillation

MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

MambaML: Exploring State Space Models for Multi-Label Image Classification

Moderating the Generalization of Score-based Generative Model

Scaling Language-Free Visual Representation Learning

Improving Noise Efficiency in Privacy-preserving Dataset Distillation

LLM-assisted Entropy-based Adaptive Distillation for Unsupervised Fine-grained Visual Representation Learning

DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection

On the Robustness Tradeoff in Fine-Tuning

Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention

Boundary Probing for Input Privacy Protection When Using LMM Services

Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

Dataset Distillation as Data Compression: A Rate-Utility Perspective

Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Open-set Cross Modal Generalization via Multimodal Unified Representation

Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization

Adversarial Robust Memory-Based Continual Learner

NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning

A Unified Framework to BRIDGE Complete and Incomplete Deep Multi-View Clustering under Non-IID Missing Patterns

HumorDB: Can AI understand graphical humor?

GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability

Ensemble Foreground Management for Unsupervised Object Discovery

Detect Anything 3D in the Wild

Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness

VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions

Mitigating Object Hallucinations via Sentence-Level Early Intervention

Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning.

One-Shot Knowledge Transfer for Scalable Person Re-Identification

ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning

PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection

Open-Unfairness Adversarial Mitigation for Generalized Deepfake Detection

EA-KD: Entropy-based Adaptive Knowledge Distillation

Structured Policy Optimization: Enhance Large Vision-Language Model via Self-referenced Dialogue

Seal Your Backdoor with Variational Defense

Semi-ViM: Bidirectional State Space Model for Mitigating Label Imbalance in Semi-Supervised Learning

Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning

Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

CODE-CL: Conceptor-Based Gradient Projection for Deep Continual Learning

SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation

Beyond the Limits: Overcoming Negative Correlation of Activation-Based Training-Free NAS

Diffusion Guided Adaptive Augmentation for Generalization in Visual Reinforcement Learning

I Am Big, You Are Little; I Am Right, You Are Wrong

Semi-supervised Deep Transfer for Regression without Domain Alignment

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Fast Globally Optimal and Geometrically Consistent 3D Shape Matching

A Framework for Double-Blind Federated Adaptation of Foundation Models

VGGSounder: Audio-Visual Evaluations for Foundation Models

EA-Vit: Efficient Adaptation for Elastic Vision Transformer

Web Artifact Attacks Disrupt Vision Language Models

Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark

Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

MMOne: Representing Multiple Modalities in One Scene

MM-IFEngine: Towards Multimodal Instruction Following

RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning

VisionMath: Vision-Form Mathematical Problem-Solving

Dataset Distillation via the Wasserstein Metric

A Good Teacher Adapts Their Knowledge for Distillation

Quanta Neural Networks: From Photons to Perception

AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention

SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets

Depth Any Event Stream: Enhancing Event-based Monocular Depth Estimation via Dense-to-Sparse Distillation

Evading Data Provenance in Deep Neural Networks

WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images

AllTracker: Efficient Dense Point Tracking at High Resolution

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

MPBR: Multimodal Progressive Bidirectional Reasoning for Open-Set Fine-Grained Recognition

MAVias: Mitigate any Visual Bias

OpenSubstance: A High-quality Measured Dataset of Multi-View and -Lighting Images and Shapes

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product

PseudoMapTrainer: Learning Online Mapping without HD Maps

LONG3R: Long Sequence Streaming 3D Reconstruction

VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding

AnnofreeOD: Detecting All Classes at Low Frame Rates Without Human Annotations

Federated Continual Instruction Tuning

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Generate, Transduct, Adapt: Iterative Transduction with VLMs

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Controlling Multimodal LLMs via Reward-guided Decoding

Improving Large Vision and Language Models by Learning from a Panel of Peers

CE-FAM: Concept-Based Explanation via Fusion of Activation Maps

PEFTDiff: Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning

Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Verbalized Representation Learning for Interpretable Few-Shot Generalization

RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications

Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

Class-Wise Federated Averaging for Efficient Personalization

Multi-view Gaze Target Estimation

EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Human-Object Interaction from Human-Level Instructions

FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Visual-RFT: Visual Reinforcement Fine-Tuning

Enhancing Transformers Through Conditioned Embedded Tokens

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Visual Modality Prompt for Adapting Vision-Language Object Detectors

What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

Prototype Guided Backdoor Defense via Activation Space Manipulation

RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction

Analyzing Finetuning Representation Shift for Multimodal LLMs Steering

Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers

VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs

Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

AVAM: a Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

What to Distill? Fast Knowledge Distillation with Adaptive Sampling

Flexi-FSCIL: Adaptive Knowledge Retention for Breaking the Stability-Plasticity Dilemma in Few-Shot Class-Incremental Learning

Multispectral Demosaicing via Dual Cameras

Generative Modeling of Shape-Dependent Self-Contact Human Poses

Met2Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems

TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions

Beyond RGB: Adaptive Parallel Processing for RAW Object Detection

egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks

PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor

TorchAdapt: Towards Light-Agnostic Real-Time Visual Perception

Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling

POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction

Boosting Class Representation via Semantically Related Instances for Robust Long-Tailed Learning with Noisy Labels

CAT: A Unified Click-and-Track Framework for Realistic Tracking

DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion

Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design

DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching

SAC-GNC: SAmple Consensus for adaptive Graduated Non-Convexity

AstroLoc: Robust Space to Ground Image Localizer

Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

Stochastic Interpolants for Revealing Stylistic Flows across the History of Art

Is Tracking really more challenging in First Person Egocentric Vision?

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Toward Material-Agnostic System Identification from Videos

MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips

Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction

ETA: Energy-based Test-time Adaptation for Depth Completion

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

SceneMI: Motion In-betweening for Modeling Human-Scene Interaction

DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image

GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

RoMo: Robust Motion Segmentation Improves Structure from Motion

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion

Robust Low-light Scene Restoration via Illumination Transition

Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning

CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy

MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

Zero-shot Inexact CAD Model Alignment from a Single Image

HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer

Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection

DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation

OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection

Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations

Scaling 3D Compositional Models for Robust Classification and Pose Estimation

RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

DAMap: Distance-aware MapNet for High Quality HD Map Construction

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation

Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array

VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding

Prior-aware Dynamic Temporal Modeling Framework for Sequential 3D Hand Pose Estimation

Epipolar Consistent Attention Aggregation Network for Unsupervised Light Field Disparity Estimation

ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling

On the Generalization of Representation Uncertainty in Earth Observation

Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Jigsaw++: Imagining Complete Shape Priors for Object Reassembly

Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation

SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion

A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks

IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection

PlaneRAS: Learning Planar Primitives for 3D Plane Recovery

FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Simultaneous Motion And Noise Estimation with Event Cameras

Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs

CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth

4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads

Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks

HccePose (BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation

GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting

PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Enhancing Image Restoration Transformer via Adaptive Translation Equivariance

Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions

Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

Understanding Flatness in Generative Models: Its Role and Benefits

Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking

PHD: Personalized 3D Human Body Fitting with Point Diffusion

Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

Language Driven Occupancy Prediction

C4D: 4D Made from 3D through Dual Correspondences

ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

Estimating 2D Camera Motion with Hybrid Motion Basis

AgroBench: Vision-Language Model Benchmark in Agriculture

Princeton365: A Diverse Dataset with Accurate Camera Pose

H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

After the Party: Navigating the Mapping From Color to Ambient Lighting

From Abyssal Darkness to Blinding Glare: A Benchmark on Extreme Exposure Correction in Real World

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Voyaging into Perpetual Dynamic Scenes from a Single View

Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge

TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization

Find Any Part in 3D

Learning 3D Scene Analogies with Neural Contextual Scene Maps

GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects

Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

SpikeDiff: Zero-shot High-Quality Video Reconstruction from Chromatic Spike Camera and Sub-millisecond Spike Streams

VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting

AJAHR: Amputated Joint Aware 3D Human Mesh Recovery

EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks

A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba

Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting

AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration

Background Invariance Testing According to Semantic Proximity

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

One Look is Enough: Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation on High-Resolution Images

Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision

RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration

PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

Training-Free Generation of Temporally Consistent Rewards from VLMs

MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization

TopicGeo: An Efficient Unified Framework for Geolocation

ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

Partially Matching Submap Helps: Uncetainty Modeling and Propagation for Text to Point Cloud Localization

Medical World Model

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild

Passing the Driving Knowledge Test

Uncertainty-Aware Gradient Stabilization for Small Object Detection

Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

4D Visual Pre-training for Robot Learning

Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy

Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes

HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

DialNav: Multi-turn Dialog Navigation with a Remote Guide

TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Environment-Agnostic Pose: Generating Environment-independent Object Representations for 6D Pose Estimation

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Online Dense Point Tracking with Streaming Memory

MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting

GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts

CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector

Test-Time Retrieval-Augmented Adaptation for Vision-Language Models

RnGCam: High-speed video from rolling & global shutter measurements

Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures

Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection

Learning on the Go: A Meta-learning Object Navigation Model

Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users

MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation

ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

ReCoT: Reflective Self-Correction Training for Mitigating Confirmation Bias in Large Vision-Language Models

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

GenHaze: Pioneering Controllable One-Step Realistic Haze Generation for Real-World Dehazing

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes

Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images

LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling

Combinative Matching for Geometric Shape Assembly

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

DyGS-SLAM: Real-Time Accurate Localization and Gaussian Reconstruction for Dynamic Scenes

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

Teaching VLMs to Localize Specific Objects from In-context Examples

SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

PartField: Learning 3D Feature Fields for Part Segmentation and Beyond

MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification

PASD: A Pixel-Adaptive Swarm Dynamics Approach for Unsupervised Low-Light Image Enhancement

CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance

Proactive Scene Decomposition and Reconstruction

Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes

EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision

A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition

IRASim: A Fine-Grained World Model for Robot Manipulation

WalkVLM: Aid Visually Impaired People Walking by Vision Language Model

Error Recognition in Procedural Videos using Generalized Task Graph

VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition Dataset

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text

RoboPearls: Editable Video Simulation for Robot Manipulation

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Multi-modal Multi-platform Person Re-Identification: Benchmark and Method

Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection

What If: Understanding Motion Through Sparse Interactions

PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models

RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration

FaceShield: Defending Facial Image against Deepfake Threats

Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

Expressive Talking Human from Single-Image with Imperfect Priors

Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models

Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin

Reverse Convolution and Its Applications to Image Restoration

MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence

Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians

X-Dancer: Expressive Music to Human Dance Video Generation

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars

AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm

PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups

TeRA: Rethinking Text-guided Realistic 3D Avatar Generation

A Unified Framework for Motion Reasoning and Generation in Human Interaction

Open-World Skill Discovery from Unsegmented Demonstration Videos

Deep Adaptive Unfolded Network via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening

EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Reference-based Super-Resolution via Image-based Retrieval-Augmented Generation Diffusion

Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

EgoM2P: Egocentric Multimodal Multitask Pretraining

E-NeMF: Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes

HUMOTO: A 4D Dataset of Mocap Human Object Interactions

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

CharaConsist: Fine-Grained Consistent Character Generation

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation

F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer

Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Blind Noisy Image Deblurring Using Residual Guidance Strategy

Drawing Developmental Trajectory from Cortical Surface Reconstruction

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Less is More: Improving Motion Diffusion Models with Sparse Keyframes

DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification

Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

TrackVerse: A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior

Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition

MBTI: Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation

Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion

PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

Disentangled Clothed Avatar Generation with Layered Representation

Augmented Mass-Spring Model for Real-Time Dense Hair Simulation

Punching Bag vs. Punching Person: Motion Transferability in Videos

G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

WarpHE4D: Dense 4D Head Map toward Full Head Reconstruction

PrimHOI: Compositional Human-Object Interaction via Reusable Primitives

Continuous-Time Human Motion Field from Event Cameras

GENMO: A GENeralist Model for Human MOtion

Efficient Track Anything

HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID

Multi-Object Sketch Animation by Scene Decomposition and Motion Planning

ISP2HRNet: Learning to Reconstruct High Resolution Image from Irregularly Sampled Pixels via Hierarchical Gradient Learning

Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

GameFactory: Creating New Games with Generative Interactive Videos

FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

EAMamba: Efficient All-Around Vision State Space Model for Image Restoration

SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

Fast Image Super-Resolution via Consistency Rectified Flow

Event-guided HDR Reconstruction with Diffusion Priors

Learning Efficient and Generalizable Human Representation with Human Gaussian Model

SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models

AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance

Robust Adverse Weather Removal via Spectral-based Spatial Grouping

Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image

Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Skeleton Motion Words for Unsupervised Skeleton-based Temporal Action Segmentation

DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation

Synthetic Video Enhances Physical Fidelity in Video Synthesis

TimeBooth: Disentangled Facial Invariant Representation for Diverse and Personalized Face Aging

Identity Preserving 3D Head Stylization with Multiview Score Distillation

IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Towards Efficient General Feature Prediction in Masked Skeleton Modeling

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes

VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos

Occlusion-robust Stylization for Drawing-based 3D Animation

Video Individual Counting for Moving Drones

NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

HADES: Human Avatar with Dynamic Explicit Hair Strands

FlowDPS : Flow-Driven Posterior Sampling for Inverse Problems

ZFusion: Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

DreamRelation: Relation-Centric Video Customization

ModSkill: Physical Character Skill Modularization

Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation

Learning A Unified Template for Gait Recognition

Synchronization of Multiple Videos

DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

VertexRegen: Mesh Generation with Continuous Level of Detail

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

Precise Action-to-Video Generation Through Visual Action Prompts

PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training

GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar

Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

GAS: Generative Avatar Synthesis from a Single Image

Less Static, More Private: Towards Transferable Privacy-Preserving Action Recognition by Generative Decoupled Learning

Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Unified Multimodal Understanding via Byte-Pair Visual Encoding

IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

Privacy-centric Deep Motion Retargeting for Anonymization of Skeleton-Based Motion Visualization

AdaDCP: Learning an Adapter with Discrete Cosine Prior for Clear-to-Adverse Domain Generalization

MorphoGen: Efficient Unconditional Generation of Long-Range Projection Neuronal Morphology via a Global-to-Local Framework

GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

Capturing head avatar with hand contacts from a monocular video

Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images

GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control

MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

UniRes: Universal Image Restoration for Complex Degradations

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing

SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion

I2V3D: Controllable Image-to-video Generation with 3D Guidance

FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Controllable Weather Synthesis and Removal with Video Diffusion Models

Sequential Gaussian Avatars with Hierarchical Motion Context

TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

T2Bs: Text-to-Character Blendshapes via Video Generation

Unfolding-Associative Encoder-Decoder Network with Progressive Alignment for Pansharpening

MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

LOMM: Latest Object Memory Management for Temporally Consistent Video Instance Segmentation

VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction

EVDM: Event-based Real-world Video Deblurring with Mamba

iManip: Skill-Incremental Learning for Robotic Manipulation

Q-Norm: Robust Representation Learning via Quality-Adaptive Normalization

Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction

MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization

π-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis?

SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions

RobAVA: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding

IDFace: Face Template Protection for Efficient and Secure Identification

Not All Degradations Are Equal: A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution

I2VControl: Disentangled and Unified Video Motion Synthesis Control

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

On-Device Diffusion Transformer Policy for Efficient Robot Manipulation

Generic Event Boundary Detection via Denoising Diffusion

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

SHeaP: Self-supervised Head Geometry Predictor Learned via 2D Gaussians

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars

Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration

IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution

Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling

NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling

PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

MistSense: Versatile Online Detection of Procedural and Execution Mistakes

SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting

LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables

Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding

DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

AnimalClue: Recognizing Animals by their Traces

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models

OminiControl: Minimal and Universal Control for Diffusion Transformer

Penalizing Boundary Activation for Object Completeness in Diffusion Models

RayZer: A Self-supervised Large View Synthesis Model

MatchDiffusion: Training-free Generation of Match-Cuts

Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Straighten Viscous Rectified Flow via Noise Optimization

Scalable Dual Fingerprinting for Hierarchical Attribution of Text-to-Image Models

QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

CRAM: Large Scale Video Continual Learning with Bootstrapped Compression

Tree-NeRV: Efficient Non-Uniform Sampling for Neural Video Representation via Tree-Structured Feature Grids

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

ForCenNet: Foreground-Centric Network for Document Image Rectification

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation

SDMatte: Grafting Diffusion Models for Interactive Matting

Adaptive Caching for Faster Video Generation with Diffusion Transformers

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Edicho: Consistent Image Editing in the Wild

LUSD: Localized Update Score Distillation for Text-Guided Image Editing

FlowChef: Steering of Rectified Flow Models for Controlled Generations

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models

Grouped Speculative Decoding for Autoregressive Image Generation

Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

SynTag: Enhancing the Geometric Robustness of Inversion-based Generative Image Watermarking

Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models

Text Embedding Knows How to Quantize Text-Guided Diffusion Models

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

NeuralSVG: An Implicit Representation for Text-to-Vector Generation

IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models

Global and Local Entailment Learning for Natural World Imagery

Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

Anti-Tamper Protection for Unauthorized Individual Image Generation

Continual Personalization for Diffusion Models

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

Spectral Image Tokenizer

QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning

SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Split-and-Combine: Enhancing Style Augmentation for Single Domain Generalization

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Zero-Shot Depth Aware Image Editing with Diffusion Models

StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance

TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

Who Controls the Authorization? Invertible Networks for Copyright Protection in Text-to-Image Synthesis

SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation

MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

Magic Insert: Style-Aware Drag-and-Drop

DIVE: Taming DINO for Subject-Driven Video Editing

FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process

PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting

TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

Beyond Perspective: Neural 360-Degree Video Compression

MCID: Multi-aspect Copyright Infringement Detection for Generated Images

Text2Outfit: Controllable Outfit Generation with Multimodal Language Models

Outlier-Aware Post-Training Quantization for Image Super-Resolution

DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

STIV: Scalable Text and Image Conditioned Video Generation

D3QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models

One-Step Specular Highlight Removal with Adapted Diffusion Models

DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

MV-Adapter: Multi-View Consistent Image Generation Made Easy

On Large Multimodal Models as Open-World Image Classifiers

VACE: All-in-One Video Creation and Editing

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Cross-Granularity Online Optimization with Masked Compensated Information for Learned Image Compression

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Deeply Supervised Flow-Based Generative Models

Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Edit360: 2D Image Edits to 3D Assets from Any Angle

FlowTok: Flowing Seamlessly Across Text and Image Tokens

TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance

YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

FiVE-Bench: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

Co-Painter: Fine-Grained Controllable Image Stylization via Implicit Decoupling and Adaptive Injection

PLA: Prompt Learning Attack against Text-to-Image Generative Models

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Holistic Tokenizer for Autoregressive Image Generation

Toward Better Out-painting: Improving the Image Composition with Initialization Policy Model

From Image to Video: An Empirical Study of Diffusion Representations

Versatile Transition Generation with Image-to-Video Diffusion

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

DiffIP: Representation Fingerprints for Robust IP Protection of Diffusion Models

FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild

Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection

HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation

Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

Rectifying Magnitude Neglect in Linear Attention

RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

LOTA: Bit-Planes Guided AI-Generated Image Detection

Balanced Image Stylization with Style Matching Score

Trade-offs in Image Generation: How Do Different Dimensions Interact?

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Long Context Tuning for Video Generation

DreamFuse: Adaptive Image Fusion with Diffusion Transformer

AnyI2V: Animating Any Conditional Image with Motion Control

EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Instruction-based Image Editing with Planning, Reasoning, and Generation

HDR Image Generation via Gain Map Decomposed Diffusion

ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning

Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Accelerating Diffusion Transformer via Gradient-Optimized Cache

The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation

Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

ArtEditor: Learning Customized Instructional Image Editor from Few-Shot Examples

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

A3GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting

LayerD: Decomposing Raster Graphic Designs into Layers

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Subjective Camera 1.0: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

GlassWizard: Harvesting Diffusion Priors for Glass Surface Detection

Zero-Shot Compositional Video Learning with Coding Rate Reduction

FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models

HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding

DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Free2Guide: Training-Free Text-to-Video Alignment using Image LVLM

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

SpecGuard: Spectral Projection-based Advanced Invisible Watermarking

DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models

Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts

GFPack++: Attention-Driven Gradient Fields for Optimizing 2D Irregular Packing

Denoising Token Prediction in Masked Autoregressive Models

LACONIC: A 3D Layout Adapter for Controllable Image Creation

Preserve Anything: Controllable Image Synthesis with Object Preservation

Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Parametric Shadow Control for Portrait Generation in Text-to-Image Diffusion Models

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

CompleteMe: Reference-based Human Image Completion

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

EEGMirror: Leveraging EEG data in the wild via Montage-Agnostic Self-Supervision for EEG to Video Decoding

Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer

UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation

UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

Semantic Discrepancy-aware Detector for Image Forgery Identification

Scalable Ranked Preference Optimization for Text-to-Image Generation

FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions

Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

FonTS: Text Rendering With Typography and Style Controls

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Gain-MLP: Improving HDR Gain Map Encoding via a Lightweight MLP

From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Sparse Fine-Tuning of Transformers for Generative Tasks

FlexGen: Flexible Multi-View Generation from Text and Image Inputs

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Learning Implicit Features with Flow-Infused Transformations for Realistic Virtual Try-On

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Learned Image Compression with Hierarchical Progressive Context Modeling

Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing

Teleportraits: Training-Free People Insertion into Any Scene

DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing

Context Guided Transformer Entropy Modeling for Video Compression

UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint

DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Bi-Level Optimization for Self-Supervised AI-Generated Face Detection

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Neighboring Autoregressive Modeling for Efficient Visual Generation

FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting

QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing

Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

Always Skip Attention

BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks

Blended Point Cloud Diffusion for Localized Text-guided Shape Editing

VSC: Visual Search Compositional Text-to-Image Diffusion Model

Fine-Tuning Visual Autogressive Models for Subject-Driven Generation

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

DLF: Extreme Image Compression with Dual-generative Latent Fusion

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI

Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

PixTalk: Controlling Photorealistic Image Processing and Editing with Language

ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification

A Unified Framework for Industrial Cel-Animation Colorization with Temporal-Structural Awareness

Generative Video Bi-flow

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

LayerLock: Non-collapsing Representation Learning with Progressive Freezing

Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

JPEG Processing Neural Operator for Backward-Compatible Coding

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

All Parts Matter: A Unified Mask-Free Virtual Try-On Framework

Function-centric Bayesian Network for Zero-Shot Object Goal Navigation

Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!

Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models

LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

MH-LVC: Multi-Hypothesis Temporal Prediction for Learned Conditional Residual Video Coding

Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

On the Provable Importance of Gradients for Autonomous Language-Assisted Image Clustering

Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos

CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

An Efficient Hybrid Vision Transformer for TinyML Applications

Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction

CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

Visual Test-time Scaling for GUI Agent Grounding

Multi-Schema Proximity Network for Composed Image Retrieval

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

DiSCO-3D : Discovering and Segmenting Sub-Concepts from Open-vocabulary Queries in NeRF

ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detection

Test-time Adaptation for Foundation Medical Segmentation Model Without Parametric Updates

ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers

M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast

Moment Quantization for Video Temporal Grounding

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

S⁴M: Boosting Semi-Supervised Instance Segmentation with SAM

Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning

ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation

DiffPS: Leveraging Prior Knowledge of Diffusion Model for Person Search

Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation

OVG-HQ: Online Video Grounding with Hybrid-modal Queries

Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation

Representation Shift: Unifying Token Compression with FlashAttention

ZipVL: Accelerating Vision-Language Models through Dynamic Token Sparsity

ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

LaCoOT: Layer Collapse through Optimal Transport

Fuzzy Contrastive Decoding to Alleviate Object Hallucination in Large Vision-Language Models

Semantic versus Identity: A Divide-and-Conquer Approach towards Adjustable Medical Image De-Identification

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Superpowering Open-Vocabulary Object Detectors for X-ray Vision

RhythmGuassian: Repurposing Generalizable Gaussian Model For Remote Physiological Measurement

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale

On the Recovery of Cameras from Fundamental Matrices

Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks

MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance

The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning

CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Robustifying Zero-Shot Vision Language Models by Subspaces Alignment

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention

SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images

Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models

OuroMamba: A Data-Free Quantization Framework for Vision Mamba

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images

FE-CLIP: Frequency Enhanced CLIP Model for Zero-Shot Anomaly Detection and Segmentation

Referring Expression Comprehension for Small Objects

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

Text-guided Visual Prompt DINO for Generic Segmentation

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows

MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs

Cracking Instance Jigsaw Puzzles: A Superior Alternative to Multiple Instance Learning for Whole Slide Image Analysis

STDDNet: Harnessing Mamba for Video Polyp Segmentation via Spatial-aligned Temporal Modeling and Discriminative Dynamic Representation Learning

FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation

DecAD: Decoupling Anomalies in Latent Space for Multi-Class Unsupervised Anomaly Detection

Few-Shot Pattern Detection via Template Matching and Regression

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text matching

RA-BUSSeg: Relation-aware Semi-supervised Breast Ultrasound Image Segmentation via Adjacent Propagation and Cross-layer Alignment

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation

Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

VideoAds for Fast-Paced Video Understanding

Auto-Controlled Image Perception in MLLMs via Visual Perception Tokens

Refer to Any Segmentation Mask Group With Vision-Language Prompts

Triad: Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process

Bridging the Gap between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding

DisTime: Distribution-based Time Representation for Video Large Language Models

WeaveSeg: Iterative Contrast-weaving and Spectral Feature-refining for Nuclei Instance Segmentation

How Can Objects Help Video-Language Understanding?

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

CARIM: Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Modeling Saliency Dataset Bias

Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Controllable Latent Space Augmentation for Digital Pathology

PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

Balanced Sharpness-Aware Minimization for Imbalanced Regression

MIEB: Massive Image Embedding Benchmark

Interpretable point cloud classification using multiple instance learning

Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts

Progressive Test Time Energy Adaptation for Medical Image Segmentation

Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment

SignRep: Enhancing Self-Supervised Sign Representations

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Learning Beyond Still Frames: Scaling Vision-Language Models with Video

Is CLIP ideal? No. Can we fix it? Yes!

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Dynamic Dictionary Learning for Remote Sensing Image Segmentation

Temporal-aware Query Routing for Real-time Video Instance Segmentation

Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

Towards Fine-grained Interactive Segmentation in Images and Videos

Learnable Retrieval Enhanced Visual-Text Alignment and Fusion for Radiology Report Generation

Generalizable Object Re-Identification via Visual In-Context Prompting

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Anomaly Detection of Integrated Circuits Package Substrates Using the Large Vision Model SAIC: Dataset Construction, Methodology, and Application

Streaming VideoLLMs for Real-Time Procedural Video Understanding

Prompt-driven Transferable Adversarial Attack on Person Re-Identification with Attribute-aware Textual Inversion

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Aligning Effective Tokens with Video Anomaly in Large Language Models

No More Sibling Rivalry: Debiasing Human-Object Interaction Detection

Borrowing Eyes for the Blind Spot: Overcoming Data Scarcity in Malicious Video Detection via Cross-Domain Retrieval Augmentation

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Sim-DETR: Unlock DETR for Temporal Sentence Grounding

ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

DIH-CLIP: Unleashing the Diversity of Multi-Head Self-Attention for Training-Free Open-Vocabulary Semantic Segmentation

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

LVBench: An Extreme Long Video Understanding Benchmark

Debiasing Trace Guidance: Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection

Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations

MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning

ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches

Similarity Memory Prior is All You Need for Medical Image Segmentation

CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model

Bringing RNNs Back to Efficient Open-Ended Video Understanding

Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

Cross-Architecture Distillation Made Simple with Redundancy Suppression

DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation

FIND: Few-Shot Anomaly Inspection with Normal-Only Multi-Modal Data

VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference

Unsupervised Histopathological Image Semantic Segmentation with Overlapping Patches Consistency Constraint

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents

VIPerson: Flexibly Generating Virtual Identity for Person Re-Identification

Towards Robustness of Person Search against Corruptions

Flow-MIL: Constructing Highly-expressive Latent Feature Space For Whole Slide Image Classification Using Normalizing Flow

HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss

CompCap: Improving Multimodal Large Language Models with Composite Captions

Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation

Seeing the Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation

ViLLa: Video Reasoning Segmentation with Large Language Model

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Object-level Correlation for Few-Shot Segmentation

Vision-Language Neural Graph Featurization for Extracting Retinal Lesions

SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting

RadGPT: Constructing 3D Image-Text Tumor Datasets

LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

An OpenMind for 3D Medical Vision Self-supervised Learning

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration

MINERVA: Evaluating Complex Video Reasoning

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in Medical Image Segmentation with Learnable Prior

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality

NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

Emulating Self-attention with Convolution for Efficient Image Super-Resolution

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal

Vision-Language Models Can't See the Obvious

Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Kaputt: A Large-Scale Dataset for Visual Defect Detection

ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models

Auto-Vocabulary Semantic Segmentation

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning

Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training

Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Breaking Grid Constraints: Dynamic Graph Reconstruction Network for Multi-organ Segmentation

MaskSAM: Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation

Large-scale Pre-training for Grounded Video Caption Generation

MEH: A Multi-Style Dataset and Toolkit for Advancing Egyptian Hieroglyph Recognition

Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval

Unbiased Missing-modality Multimodal Learning

ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba

Axis-level Symmetry Detection with Group-Equivariant Representation

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes

YOLOE: Real-Time Seeing Anything

Mixture-of-Scores: Robust Image-Text Data Valuation via Three Lines of Code

Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Bolt3D: Generating 3D Scenes in Seconds

Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Large Scene Generation with Cube-Absorb Discrete Diffusion

Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging

RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

LookOut: Real-World Humanoid Egocentric Navigation

Occupancy Learning with Spatiotemporal Memory

PointGAC: Geometric-Aware Codebook for Masked Point Modeling

Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

PRM: Photometric Stereo based Large Reconstruction Model

4D Gaussian Splatting SLAM

Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors

Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates

RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors

Dual-S3D: Hierarchical Dual-Path Selective SSM-CNN for High-Fidelity Implicit Reconstruction

AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction

RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

Gaussian Splatting with Discretized SDF for Relightable Assets

MMGeo: Multimodal Compositional Geo-Localization for UAVs

AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes

SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

Neural Shell Texture Splatting: More Details and Fewer Primitives

Gaussian-based World Model: Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction

Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

A Real-world Display Inverse Rendering Dataset

RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion

Federated Domain Generalization with Domain-specific Soft Prompts Generation

GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion

GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views

REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment

Towards Safer and Understandable Driver Intention Prediction

V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals

EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction

Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

Lifting the Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling

Global Regulation and Excitation via Attention Tuning for Stereo Matching

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

Semantic-guided Camera Ray Regression for Visual Localization

SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Polarimetric Neural Field via Unified Complex-Valued Wave Representation

High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos

Street Gaussians without 3D Object Tracker

HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity

RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Scene Coordinate Reconstruction Priors

Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

I2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation

RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction

TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Removing Out-of-Focus Reflective Flares via Color Alignment

Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting

MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy

CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving

Free-running vs Synchronous: Single-Photon Lidar for High-flux 3D Imaging

Mitigating Geometric Degradation in Fast DownSampling via FastAdapter for Point Cloud Segmentation

Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising

ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

Discontinuity-aware Normal Integration for Generic Central Camera Models

SEHDR: Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing

SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation

TARS: Traffic-Aware Radar Scene Flow Estimation

DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection

Leaps and Bounds: An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction

GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching

DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model

EDM: Efficient Deep Feature Matching

GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors

NeRF Is a Valuable Assistant for 3D Gaussian Splatting

UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images

TOTP: Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion

UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting

StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting

TurboReg: TurboClique for Robust and Efficient Point Cloud Registration

Efficient Spiking Point Mamba for Point Cloud Analysis

SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images

Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves

GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections

PolarAnything: Diffusion-based Polarimetric Image Synthesis

LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions

ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models

MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction

Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models

Towards Open-World Generation of Stereo Images and Unsupervised Matching

LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping

MiDSummer: Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation

Spatio-Spectral Pattern Illumination for Direct and Indirect Separation from a Single Hyperspectral Image

Adversarial Exploitation of Data Diversity Improves Visual Localization

GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-based Transformer

Tile-wise vs. Image-wise: Random-Tile Loss and Training Paradigm for Gaussian Splatting

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Explaining Human Preferences via Metrics for Structured 3D Reconstruction

CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

Inverse 3D Microscopy Rendering for Cell Shape Inference with Active Mesh

GaussRender: Learning 3D Occupancy with Gaussian Rendering

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis

ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors

LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching

End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Planar Affine Rectification from Local Change of Scale and Orientation

ERNet: Efficient Non-Rigid Registration Network for Point Sequences

SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection

Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Epona: Autoregressive Diffusion World Model for Autonomous Driving

Leveraging Local Patch Alignment to Seam-cutting for Large Parallax Image Stitching

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

SynCity: Training-Free Generation of 3D Cities

PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions

ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video

Thermal Polarimetric Multi-view Stereo

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

SFUOD: Source-Free Unknown Object Detection

MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography

GenFlow3D: Generative Scene Flow Estimation and Prediction on Point Cloud Sequences

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes

Tree Skeletonization from 3D Point Clouds by Denoising Diffusion

Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping

Purge-Gate: Efficient Backpropagation-Free Test-Time Adaptation for Point Clouds via Token purging

AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering

BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns

FlowR: Flowing from Sparse to Dense 3D Reconstructions

WorldScore: Unified Evaluation Benchmark for World Generation

LightSwitch: Multi-view Relighting with Material-guided Diffusion

Decoupled Diffusion Sparks Adaptive Scene Generation

Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

SP2T: Sparse Proxy Attention for Dual-stream Point Transformer

Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting

CF3: Compact and Fast 3D Feature Fields

When Anchors Meet Cold Diffusion: A Multi-Stage Approach to Lane Detection

2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update

Faster and Better 3D Splatting via Group Training

Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

NeuFrameQ: Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation

RTMap: Real-Time Recursive Mapping with Change Detection and Localization

Controllable 3D Outdoor Scene Generation via Scene Graphs

PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Driving View Synthesis on Free-form Trajectories with Generative Prior

Wasserstein Style Distribution Analysis and Transform for Stylized Image Generation

Constraint-Aware Feature Learning for Parametric Point Cloud

NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement

ZeroStereo: Zero-shot Stereo Matching from Single Images

CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

Stochastic Gradient Estimation for Higher-Order Differentiable Rendering

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives

Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes

Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception

ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration

Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding

Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching

R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception

V2XScenes: A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception

Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

G2D: Boosting Multimodal Learning with Gradient-Guided Distillation

Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency

SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

SAM4D: Segment Anything in Camera and LiDAR Streams

Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models

LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning

Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing

DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception

Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction

Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views

A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Extrapolated Urban View Synthesis Benchmark

Heatmap Regression without Soft-Argmax for Facial Landmark Detection

Demeter: A Parametric Model of Crop Plant Morphology from the Real World

Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation

FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images

HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes

Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving

BANet: Bilateral Aggregation Network for Mobile Stereo Matching

Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions

Authentic 4D Driving Simulation with a Video Generation Model

DONUT: A Decoder-Only Model for Trajectory Prediction

Lidar Waveforms are Worth 40x128x33 Words

Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation

PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering

Wide2Long: Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation

EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection

Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes

Super Resolved Imaging with Adaptive Optics

HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network

Stealthy Backdoor Attack in Federated Learning via Adaptive Layer-wise Gradient Alignment

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis

Importance-Based Token Merging for Efficient Image and Video Generation

Knowledge Distillation for Learned Image Compression

Variance-Based Pruning for Accelerating and Compressing Trained Networks

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars

Understanding Co-speech Gestures in-the-wild

DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

Towards a Unified Copernicus Foundation Model for Earth Vision

Teeth Reconstruction and Performance Capture Using a Phone Camera

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Spatially-Varying Autofocus

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

RePoseD: Efficient Relative Pose Estimation With Known Depth Information

Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Certifiably Optimal Anisotropic Rotation Averaging

MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration

MikuDance: Animating Character Art with Mixed Motion Dynamics

ROAR: Reducing Inversion Error in Generative Image Watermarking

Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

SuperDec: 3D Scene Decomposition with Superquadrics Primitives

E-SAM: Training-Free Segment Every Entity Model

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Towards Foundational Models for Single-Chip Radar

Make Your Training Flexible: Towards Deployment-Efficient Video Models

M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization

Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description

What You Have is What You Track: Adaptive and Robust Multimodal Tracking

Low-Light Image Enhancement using Event-Based Illumination Estimation

Multi-Modal Few-Shot Temporal Action Segmentation

WildSAT: Learning Satellite Image Representations from Wildlife Observations

Forgetting Through Transforming: Enabling Federated Unlearning via Class-Aware Representation Transformation

SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations

SpectralAR: Spectral Autoregressive Visual Generation

Sibai: A Few-Shot Meta-Classifier for Poisoning Detection in Federated Learning

Gradient Extrapolation for Debiased Representation Learning

Supercharging Floorplan Localization with Semantic Rays

Learning Streaming Video Representation via Multitask Training

InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation

Scaling Transformer-Based Novel View Synthesis with Models Token Disentanglement and Synthetic Data

Learning to See in the Extremely Dark

Customizing Domain Adapters for Domain Generalization

BATCLIP: Bimodal Online Test-Time Adaptation for CLIP

BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis

Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model

SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image

SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions

PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image

Cross-Subject Mind Decoding from Inaccurate Representations

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation

FlowStyler: Artistic Video Stylization via Transformation Fields Transports

ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective

StableCodec: Taming One-Step Diffusion for Extreme Image Compression

FastJSMA: Accelerating Jacobian-based Saliency Map Attacks through Gradient Decoupling

Toward Fair and Accurate Cross-Domain Medical Image Segmentation: A VLM-Driven Active Domain Adaptation Paradigm

Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion

Federated Continuous Category Discovery and Learning

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Consensus-Driven Active Model Selection

BlueNeg: A 35mm Negative Film Dataset for Restoring Channel-Heterogeneous Deterioration

Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors

Make Me Happier: Evoking Emotions Through Image Diffusion Models

Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception

What we need is explicit controllability: Training 3D gaze estimator using only facial images

SemiVisBooster: Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic Guidance

OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection

Hypergraph Clustering Network with Partial Attribute Imputation

Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition

Object-centric Video Question Answering with Visual Grounding and Referring

DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

UAVScenes: A Multi-Modal Dataset for UAVs

LIRA: Reasoning Reconstruction via Multimodal Large Language Models

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models

Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition

Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs

GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting

Boosting Adversarial Transferability via Negative Hessian Trace Regularization

FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

SIC: Similarity-Based Interpretable Image Classification with Neural Networks

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent

Event-based Tiny Object Detection: A Benchmark Dataset and Baselines

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints

Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation

CoSMIC: Continual Self-supervised Learning for Multi-Domain Medical Imaging via Conditional Mutual Information Maximization

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

SEAL: Semantic Aware Image Watermarking

ArchiSet: Benchmarking Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios

SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing

Unsupervised Identification of Protein Compositions and Conformations via Implicit Content-Transformation Disentanglement

SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting

Splat-based 3D Scene Reconstruction with Extreme Motion-blur

Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

DMesh++: An Efficient Differentiable Mesh for Complex Shapes

Advancing Textual Prompt Learning with Anchored Attributes

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning

ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance

AdsQA: Towards Advertisement Video Understanding

Memory-Efficient Generative Models via Product Quantization

ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection

Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis

Multimodal Prompt Alignment for Facial Expression Recognition

CogCM: Cognition-Inspired Contextual Modeling for Audio-Visual Speech Enhancement

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction

Leveraging Debiased Cross-modal Attention Maps and Code-based Reasoning for Zero-shot Referring Expression Comprehension

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

Improving Multimodal Learning via Imbalanced Learning

SITE: towards Spatial Intelligence Thorough Evaluation

SHIFT: Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models

Stable Score Distillation

Synergistic Prompting for Robust Visual Recognition with Missing Modalities

Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

Automated Red Teaming for Text-to-Image Models through Feedback-Guided Prompt Iteration with Vision-Language Models

RAGD: Regional-Aware Diffusion Model for Text-to-Image Generation

Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation

Knowledge Distillation with Refined Logits

Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection

BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting

Domain Generalizable Portrait Style Transfer

PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model

Diffusion Image Prior

Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

HERO: Human Reaction Generation from Videos

Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method

A Unified Interpretation of Training-Time Out-of-Distribution Detection

VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation

G2PDiffusion: Cross-species Genotype-to-Phenotype Prediction via Evolutionary Diffusion

Mamba-3VL: Taming State Space Model for 3D Vision Language Learning

Embodied Representation Alignment with Mirror Neurons

Referring to Any Person

Selective Contrastive Learning for Weakly Supervised Affordance Grounding

CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Task-Specific Zero-shot Quantization-Aware Training for Object Detection

Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup

EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation

MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking

Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset

AG2aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing

Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision

InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior

CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Benchmarking Multimodal Large Language Models Against Image Corruptions

Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection

DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover

Latent Expression Generation for Referring Image Segmentation and Grounding

LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

Efficient Fine-Tuning of Large Models via Nested Low-Rank Adaptation

Dual-level Prototype Learning for Composite Degraded Image Restoration

Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation

BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Deterministic Object Pose Confidence Region Estimation

Online Language Splatting

JailbreakDiffBench: A Comprehensive Benchmark for Jailbreaking Diffusion Models

Efficient Input-level Backdoor Defense on Text-to-Image Synthesis via Neuron Activation Variation

Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning

Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

GReg: Geometry-Aware Region Refinement for Sign Language Video Generation

Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints

NETracer: A Topology-Aware Iterative Tracing Approach for Tubular Structure Extraction

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

UIPro: Unleashing Superior Interaction Capability For GUI Agents

SALAD -- Semantics-Aware Logical Anomaly Detection

FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Visual Relation Diffusion for Human-Object Interaction Detection

Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting

VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving

Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Knowledge Transfer from Interaction Learning

WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

Multi-modal Segment Anything Model for Camouflaged Scene Segmentation

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

MagicColor: Multi-instance Sketch Colorization

Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

Cassic: Towards Content-Adaptive State-Space Models for Learned Image Compression

Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation

AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images

OneGT: One-Shot Geometry-Texture Neural Rendering for Head Avatars

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Unsupervised Visible-Infrared Person Re-identification under Unpaired Settings

Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection

Can We Achieve Efficient Diffusion Without Self-Attention? Distilling Self-Attention into Convolutions

Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

PixelStitch: Structure-Preserving Pixel-Wise Bidirectional Warps for Unsupervised Image Stitching

A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization

Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation

AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation

Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics

Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

S$^3$E: Self-Supervised State Estimation for Radar-Inertial System

Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

Scalable Image Tokenization with Index Backpropagation Quantization

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

Coupling the Generator with Teacher for Effective Data-Free Knowledge Distillation

Towards a Universal Image Degradation Model via Content-Degradation Disentanglement

Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery

HUST: High-Fidelity Unbiased Skin Tone Estimation via Texture Quantization

One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution

Video Color Grading via Look-Up Table Generation

Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

ProbMED: A Probabilistic Framework for Medical Multimodal Binding

You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data

FDPT: Federated Discrete Prompt Tuning for Black-Box Visual-Language Models

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning

Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

A Tiny Change, A Giant Leap: Long-Tailed Class-Incremental Learning via Geometric Prototype Alignment

CountSE: Soft Exemplar Open-set Object Counting

Sparfels: Fast Reconstruction from Sparse Unposed Imagery

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

Learning 4D Embodied World Models

MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model

Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View

ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

Underwater Visual SLAM with Depth Uncertainty and Medium Modeling

MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

Region-Level Data Attribution for Text-to-Image Generative Models

Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting

Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Generalization-Preserved Learning: Closing the Backdoor to Catastrophic Forgetting in Continual Deepfake Detection

Open-Vocabulary Octree-Graph for 3D Scene Understanding

LangBridge: Interpreting Image as a Combination of Language Embeddings

IGD: Instructional Graphic Design with Multimodal Layer Generation

Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery

DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception

Learning Normal Flow Directly From Events

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

AIRA: Activation-Informed Low-Rank Adaptation for Large Models

Robust Unfolding Network for HDR Imaging with Modulo Cameras

Embodied Navigation with Auxiliary Task of Action Description Prediction

IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

An Inversion-based Measure of Memorization for Diffusion Models

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Face Retouching with Diffusion Data Generation and Spectral Restorement

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents

HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving

Contrastive Flow Matching

Class Token as Proxy: Optimal Transport-assisted Proxy Learning for Weakly Supervised Semantic Segmentation

Neural Compression for 3D Geometry Sets

DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning

AllGCD: Leveraging All Unlabeled Data for Generalized Category Discovery

Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory

UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

OmniVTON: Training-Free Universal Virtual Try-On

FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene

CopyrightShield: Enhancing Diffusion Model Security Against Copyright Infringement Attacks

CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training

Learnable Logit Adjustment for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch

SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images

Dataset Ownership Verification for Pre-trained Masked Models

CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

TCFG: Truncated Classifier-Free Guidance for Efficient and Scalable Text-to-Image Acceleration

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner

FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

MSA2: Multi-task Framework with Structure-aware and Style-adaptive Character Representation for Open-set Chinese Text Recognition

DiffPCI: Large Motion Point Cloud frame Interpolation with Diffusion Model

GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation

MultiModal Action Conditioned Video Simulation

Local Dense Logit Relations for Enhanced Knowledge Distillation

FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

Soft Local Completeness: Rethinking Completeness in XAI

ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring

PBFG: A New Physically-Based Dataset and Removal of Lens Flares and Glares

Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

An Information-Theoretic Regularizer for Lossy Neural Image Compression

Knowledge-Guided Part Segmentation

Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation

UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions

KV-Edit: Training-Free Image Editing for Precise Background Preservation

FusionPhys: A Flexible Framework for Fusing Complementary Sensing Modalities in Remote Physiological Measurement

You Think, You ACT: The New Task of Arbitrary Text to Motion Generation

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

End-to-End Multi-Modal Diffusion Mamba

PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection

Adapting In-Domain Few-Shot Segmentation to New Domains without Source Domain Retraining

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching

DADet: Safeguarding Image Conditional Diffusion Models against Adversarial and Backdoor Attacks via Diffusion Anomaly Detection

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Rethinking Layered Graphic Design Generation with a Top-Down Approach

LEGO-Maker: A Semantic-Driven Algorithm for Text-to-3D Generation

COVTrack: Continuous Open-Vocabulary Tracking via Adaptive Multi-Cue Fusion

Dense Policy: Bidirectional Autoregressive Learning of Actions

monoVLN: Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation

3D Mesh Editing using Masked LRMs

DOGR: Towards Versatile Visual Document Grounding and Referring

ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos

PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations

Performing Defocus Deblurring by Modeling its Formation Process

CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance

Supervised Exploratory Learning for Long-Tailed Visual Recognition

Membership Inference Attacks with False Discovery Rate Control

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Blind Video Super-Resolution based on Implicit Kernels

OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts

PLMP - Point-Line Minimal Problems for Projective SfM

More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning

SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

DCHM: Depth-Consistent Human Modeling for Multiview Detection

Adversarial Robustness of Discriminative Self-Supervised Learning in Vision

HPSv3: Towards Wide-Spectrum Human Preference Score

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Active Perception Meets Rule-Guided RL: A Two-Phase Approach for Precise Object Navigation in Complex Environments

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

UNIS: A Unified Framework for Achieving Unbiased Neural Implicit Surfaces in Volume Rendering

Loss Functions for Predictor-based Neural Architecture Search

Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

Dual-Process Image Generation

IntrinsicControlNet: Cross-distribution Image Generation with Real and Unreal

Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images

Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

Decoding Correlation-Induced Misalignment in the Stable Diffusion Workflow for Text-to-Image Generation

Steering Guidance for Personalized Text-to-Image Diffusion Models

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction

Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds

Spatial-Temporal Aware Visuomotor Diffusion Policy Learning

GaussianReg: Rapid 2D/3D Registration for Emergency Surgery via Explicit 3D Modeling with Gaussian Primitives

Learning Robust Image Watermarking with Lossless Cover Recovery

ArgoTweak: Towards Self-Updating HD Maps through Structured Priors

Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime

Context-Aware Academic Emotion Dataset and Benchmark

FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging

SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection

Snakes and Ladders: Two Steps Up for VideoMamba

Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration

COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets

NATRA: Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations

MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling

General Compression Framework for Efficient Transformer Object Tracking

UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation

FedXDS: Leveraging Model Attribution Methods to counteract Data Heterogeneity in Federated Learning

Hybrid Layout Control for Diffusion Transformer: Fewer Annotations, Superior Aesthetics

PLAN: Proactive Low-Rank Allocation for Continual Learning

Leveraging Spatial Invariance to Boost Adversarial Transferability

AnyPortal: Zero-Shot Consistent Video Background Replacement

Textured 3D Regenerative Morphing with 3D Diffusion Prior

Visual Textualization for Image Prompted Object Detection

TerraMind: Large-Scale Generative Multimodality for Earth Observation

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images

ZIM: Zero-Shot Image Matting for Anything

Inference-Time Diffusion Model Distillation

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints

EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow

Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

DAViD: Data-efficient and Accurate Vision Models from Synthetic Data

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Scene Graph Guided Generation: Enable Accurate Relations Generation in Text-to-Image Models via Textural Rectification

ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection

GMMamba: Group Masking Mamba for Whole Slide Image Classification

TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

RareCLIP: Rarity-aware Online Zero-shot Industrial Anomaly Detection

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

MOSCATO: Predicting Multiple Object State Change Through Actions

Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

Temporal Rate Reduction Clustering for Human Motion Segmentation

Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction

SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications

Backdoor Mitigation by Distance-Driven Detoxification

Democratizing High-Fidelity Co-Speech Gesture Video Generation

UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

HFD-Teacher: High-Frequency Depth Distillation from Depth Foundation Models for Enhanced Depth Completion

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Separation for Better Integration: Disentangling Edge and Motion in Event-based Deblurring

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Diversity-Enhanced Distribution Alignment for Dataset Distillation

Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking

Two Losses, One Goal: Balancing Conflict Gradients for Semi-supervised Semantic Segmentation

Region-based Cluster Discrimination for Visual Representation Learning

CMB-ML: A Cosmic Microwave Background Dataset for the Oldest Possible Computer Vision Task

Adapt Foundational Segmentation Models with Heterogeneous Searching Space

Think Twice: Test-Time Reasoning for Robust CLIP Zero-Shot Classification

Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes

Acknowledging Focus Ambiguity in Visual Questions

Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction

Shape of Motion: 4D Reconstruction from a Single Video

VSSD: Vision Mamba with Non-Causal State Space Duality

EditCLIP: Representation Learning for Image Editing

Counting Stacked Objects

Joint Self-Supervised Video Alignment and Action Segmentation

TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

Allowing Oscillation Quantization: Overcoming Solution Space Limitation in Low Bit-Width Quantization

MOVE: Motion-Guided Few-Shot Video Object Segmentation

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework

FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering

SDFormer: Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer

Enhancing Numerical Prediction of MLLMs with Soft Labeling

TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation

RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control

ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

DeFSS: Image-to-Mask Denoising Learning for Few-shot Segmentation

Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

TAD-E2E: A Large-scale End-to-end Autonomous Driving Dataset

MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching

VAGUE: Visual Contexts Clarify Ambiguous Expressions

Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering

Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene

Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels

VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency

RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning

Multi-scenario Overlapping Text Segmentation with Depth Awareness

Zero-Shot Vision Encoder Grafting via LLM Surrogates

OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention

SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection

Exploring the Visual Feature Space for Multimodal Neural Decoding

Dataset Distillation via Vision-Language Category Prototype

ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement

ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning

Backdoor Defense via Enhanced Splitting and Trap Isolation

Learning Hierarchical Line Buffer for Image Processing

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions

Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery

D3: Training-Free AI-Generated Video Detection Using Second-Order Features

RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Stereo Any Video: Temporally Consistent Stereo Matching

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement

Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration

VideoAuteur: Towards Long Narrative Video Generation

StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

Neural Architecture Search Driven by Locally Guided Diffusion for Personalized Federated Learning

Hierarchical 3D Scene Graphs Construction Outdoors

Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection

From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning

When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation

Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba for End-to-end Whole Slide Image Analysis

Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D

Heavy Labels Out! Dataset Distillation with Label Space Lightening

Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning

CarGait: Cross-Attention based Re-ranking for Gait recognition

Incremental Few-Shot Semantic Segmentation via Multi-Level Switchable Visual Prompts

ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

StyleSRN: Scene Text Image Super-Resolution with Text Style Embedding

Frequency-Guided Diffusion for Training-Free Text-Driven Image Translation

Preacher: Paper-to-Video Agentic System

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes

Cross-Category Subjectivity Generalization for Style-Adaptive Sketch Re-ID

S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

The Source Image is the Best Attention for Infrared and Visible Image Fusion

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

MR-FIQA: Face Image Quality Assessment with Multi-Reference Representations from Synthetic Data Generation

Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond

Gait-X: Exploring X modality for Generalized Gait Recognition

Scendi Score: Prompt‑Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Exploiting Diffusion Prior for Task-driven Image Restoration

Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding

Discretized Gaussian Representation for Tomographic Reconstruction

Wave-MambaAD: Wavelet-driven State Space Model for Multi-class Unsupervised Anomaly Detection

3D Test-time Adaptation via Graph Spectral Driven Point Shift

Task-Decoupled Bézier Surface Constraint for Uneven Low-Light Image Enhancement

EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation

Scaling Laws for Native Multimodal Models

Unlearning the Noisy Correspondence Makes CLIP More Robust

KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding

VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

STEP-DETR: Advancing DETR-based Semi-Supervised Object Detection with Super Teacher and Pseudo-Label Guided Text Queries

Text-to-Any-Skeleton Motion Generation Without Retargeting

Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence

Aligning Global Semantics and Local Textures in Generative Video Enhancement

Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation

Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data

``Principal Components" Enable A New Language of Images

VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Global-Aware Monocular Semantic Scene Completion with State Space Models

DIMO: Diverse 3D Motion Generation for Arbitrary Objects

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

Beyond Blur: A Fluid Perspective on Generative Diffusion Models

Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights

Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation

A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields

Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

PVChat: Personalized Video Chat with One-Shot Learning

AIM: Amending Inherent Interpretability via Self-Supervised Masking

From Panels to Prose: Generating Literary Narratives from Comics

GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

One Last Attention for Your Vision-Language Model

Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

MVGBench: a Comprehensive Benchmark for Multi-view Generation Models

A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

Conditional Visual Autoregressive Modeling for Pathological Image Restoration

Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes

Amodal Depth Anything: Amodal Depth Estimation in the Wild

Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation

EYE3:Turn Anything into Naked-eye 3D

C2MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis

CVPT: Cross Visual Prompt Tuning

SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior

RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

MCOP: Multi-UAV Collaborative Occupancy Prediction

Bayesian-Inspired Space-Time Superpixels

Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices

Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information

FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

Serialization based Point Cloud Oversegmentation

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Mitigating Catastrophic Overfitting in Fast Adversarial Training via Label Information Elimination

Consistency Trajectory Matching for One-Step Generative Super-Resolution

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

NeurOp-Diff: Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion

Di[M]O: Distilling Masked Diffusion Models into One-step Generator

Reinforcement Learning-Guided Data Selection via Redundancy Assessment

Φ-GAN:Physics-Inspired GAN for Generating SAR Images Under Limited Data

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Recognizing Actions from Robotic View for Natural Human-Robot Interaction

Addressing Text Embedding Leakage in Diffusion-based Image Editing

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

DDB: Diffusion Driven Balancing to Address Spurious Correlations

TurboVSR: Fantastic Video Upscalers and Where to Find Them

Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories

FRET: Feature Redundancy Elimination for Test Time Adaptation

Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generations

AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation

SPA: Efficient User-Preference Alignment against Uncertainty in Medical Image Segmentation

Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator

A₀ : An Affordance-Aware Hierarchical Model for General Robotic Manipulation

FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

PVMamba: Parallelizing Vision Mamba via Dynamic State Aggregation

Controllable and Expressive One-Shot Video Head Swapping

CoralSRT: Revisiting Coral Reef Semantic Segmentation by Feature Rectifying via Self-supervised Guidance

Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

Diagnosing Pretrained Models for Out-of-distribution Detection

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification

CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers

RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

Adversarial Training for Probabilistic Robustness

Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry

Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities

LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning

When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection

Boosting Adversarial Transferability via Residual Perturbation Attack

What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement

Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures

CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering

NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

SPD: Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection

Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions

VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Rethinking DPO-style Diffusion Aligning Frameworks

Debiased Curriculum Adaptation for Safe Transfer Learning in Chest X-ray Classification

PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing

End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation

Breaking the Encoder Barrier for Seamless Video-Language Understanding

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Information-Bottleneck Driven Binary Neural Network for Change Detection

Entropy-Adaptive Diffusion Policy Optimization with Dynamic Step Alignment

Time-Aware Auto White Balance in Mobile Photography

GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning

Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation

VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer

VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition

Evidential Knowledge Distillation

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization

GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

GUAVA: Generalizable Upper Body 3D Gaussian Avatar

CO2-Net: A Physics-Informed Spatio-Temporal Model for Global Surface CO2 Reconstruction

PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation

Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation

OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS

GSOT3D: Towards Generic 3D Single Object Tracking in the Wild

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting

Salvaging the Overlooked: Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection

Boosting Multimodal Learning via Disentangled Gradient Learning

Task Vector Quantization for Memory-Efficient Model Merging

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

HORT: Monocular Hand-held Objects Reconstruction with Transformers

Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning

SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

CaliMatch: Adaptive Calibration for Improving Safe Semi-supervised Learning

Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables

Diffusion-based 3D Hand Motion Recovery with Intuitive Physics

HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking

Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues

Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy

RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation

Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Tensor-aggregated LoRA in Federated Fine-tuning

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences

QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection

Backdooring Self-Supervised Contrastive Learning by Noisy Alignment

CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

Robust Dataset Condensation using Supervised Contrastive Learning

SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency

Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds

Unlocking the Potential of Diffusion Priors in Blind Face Restoration

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Self-Supervised Sparse Sensor Fusion for Long Range Perception

Joint Asymmetric Loss for Learning with Noisy Labels

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

Implicit Counterfactual Learning for Audio-Visual Segmentation

DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models

STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

MRGen: Segmentation Data Engine For Underrepresented MRI Modalities

GAP: Gaussianize Any Point Clouds with Text Guidance

DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method

Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification

AIComposer: Any Style and Content Image Composition via Feature Integration

Rethink Sparse Signals for Pose-guided Text-to-image Generation

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

MoFRR: Mixture of Diffusion Models for Face Retouching Restoration

Adversarial Reconstruction Feedback for Robust Fine-grained Generalization

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Unified Adversarial Augmentation for Improving Palmprint Recognition

Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

Stylized-Face: A Million-level Stylized Face Dataset for Face Recognition

Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations

From One to More: Contextual Part Latents for 3D Generation

Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras

Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion

Enhancing Transferability of Targeted Adversarial Examples via Inverse Target Gradient Competition and Spatial Distance Stretching

LDPose: Towards Inclusive Human Pose Estimation for Limb-Deficient Individuals in the Wild

OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding

LEGION: Learning to Ground and Explain for Synthetic Image Detection

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

Images as Noisy Labels: Unleashing the Potential of the Diffusion Model for Open-Vocabulary Semantic Segmentation

ContextFace: Generating Facial Expressions from Emotional Contexts

Agreement aware and dissimilarity oriented GLOM

SMP-Attack: Boosting the Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout

Spatial-Temporal Forgery Trace based Forgery Image Identification

Towards Annotation-Free Evaluation: KPAScore for Human Keypoint Detection

Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition

MeasureXpert: Automatic Anthropometric Measurement Extraction from Two Unregistered, Partial, Posed, and Dressed Body Scans

ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting

PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning

Dual Domain Control via Active Learning for Remote Sensing Domain Incremental Object Detection

SUV: Suppressing Undesired Video Content via Semantic Modulation Based on Text Embeddings

Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need

DiMPLe - Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

ResidualViT for Efficient Temporally Dense Video Encoding

VideoOrion: Tokenizing Object Dynamics in Videos

SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning

MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes.

OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective

Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions

MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

Randomized Autoregressive Visual Generation

CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation

Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency

MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance

Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion

Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning

Training-free Geometric Image Editing on Diffusion Models

Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Monocular Facial Appearance Capture in the Wild

Growing a Twig to Accelerate Large Vision-Language Models

MixA: A Mixed Attention approach with Stable Lightweight Linear Attention to enhance Efficiency of Vision Transformers at the Edge

Transparent Vision: A Theory of Hierarchical Invariant Representations

AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction

FreeDance: Towards Harmonic Free-Number Group Dance Generation via a Unified Framework

Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance

Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

RetinexMCNet: A Memory Controller Dominated Network for Low-Light Video Enhancement Based on Retinex

D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Sliced Wasserstein Bridge for Open-Vocabulary Video Instance Segmentation

Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

KinMo: Kinematic-aware Human Motion Understanding and Generation

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion

Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding

CODA: Repurposing Continuous VAEs for Discrete Tokenization

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

Head2Body: Body Pose Generation from Multi-sensory Head-mounted Inputs

LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

Looking in the Mirror: A Faithful Counterfactual Explanation Method for Interpreting Deep Image Classification Models

FLSeg: Enhancing Privacy and Robustness in Federated Learning under Heterogeneous Data via Model Segmentation

Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction

Trial-Oriented Visual Rearrangement

DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing

Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Gradient Decomposition and Alignment for Incremental Object Detection

PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency

TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention

MSQ: Memory-Efficient Bit Sparsification Quantization

SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models

LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition

Timestep-Aware Diffusion Model for Extreme Image Rescaling

Recovering Parametric Scenes from Very Few Time-of-Flight Pixels

Adversarial Attention Perturbations for Large Object Detection Transformers

MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding

When and Where do Data Poisons Attack Textual Inversion?

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation

SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement

Generating Physically Stable and Buildable Brick Structures from Text

An Empirical Study of Autoregressive Pre-training from Videos

Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection

SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

Engage for All: Making Ordinary Image Descriptions Appealing Again!

Seam360GS: Seamless 360° Gaussian Splatting from Real-World Omnidirectional Images

HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

BokehDiff: Neural Lens Blur with One-Step Diffusion

VCA: Video Curious Agent for Long Video Understanding

Geometry Distributions

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Debiased Teacher for Day-to-Night Domain Adaptive Object Detection

Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning

SpikePack: Enhanced Information Flow in Spiking Neural Networks with High Hardware Compatibility

Social Debiasing for Fair Multi-modal LLMs

Hierarchy-Aware Pseudo Word Learning with Text Adaptation for Zero-Shot Composed Image Retrieval

GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments

DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

Perspective-Invariant 3D Object Detection

Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors

ARMO: Autoregressive Rigging for Multi-Category Objects

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Aligning Constraint Generation with Design Intent in Parametric CAD

Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal

ConstStyle: Robust Domain Generalization with Unified Style Transformation

CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Learning Few-Step Diffusion Models by Trajectory Distribution Matching

Aether: Geometric-Aware Unified World Modeling

ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Golden Noise for Diffusion Models: A Learning Framework

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Vision-Language Interactive Relation Mining for Open-Vocabulary Scene Graph Generation

OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

Unified Open-World Segmentation with Multi-Modal Prompts

Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction

Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

LayerAnimate: Layer-level Control for Animation

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM

RogSplat: Robust Gaussian Splatting via Generative Priors

From Imitation to Innovation: The Emergence of AI's Unique Artistic Styles and the Challenge of Copyright Protection

Intra-modal and Cross-modal Synchronization for Audio-visual Deepfake Detection and Temporal Localization

ViSpeak: Visual Instruction Feedback in Streaming Videos

FedAGC: Federated Continual Learning with Asymmetric Gradient Correction

MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

Federated Representation Angle Learning

MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling

CWNet: Causal Wavelet Network for Low-Light Image Enhancement

InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild

MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP

SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation

Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack

How To Make Your Cell Tracker Say "I dunno!"

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization

BlinkTrack: Feature Tracking over 80 FPS via Events and Images

DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation

Diffusion-based Source-biased Model for Single Domain Generalized Object Detection

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

Measuring the Impact of Rotation Equivariance on Aerial Object Detection

DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

A Token-level Text Image Foundation Model for Document Understanding

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network

Enhanced Pansharpening via Quaternion Spatial-Spectral Interactions

Monocular Semantic Scene Completion via Masked Recurrent Networks

Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

Instance-Level Video Depth in Groups Beyond Occlusions

Flow Stochastic Segmentation Networks

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Future-Aware Interaction Network For Motion Forecasting

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions

PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling

From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning

ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Predictions

Latent Diffusion Models with Masked AutoEncoders

DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization

From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning

Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping

MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent

Optical Model-Driven Sharpness Mapping for Autofocus in Small Depth-of-Field and Severe Defocus Scenarios

HyPiDecoder: Hybrid Pixel Decoder for Efficient Segmentation and Detection

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

MMAD: Multi-label Micro-Action Detection in Videos

MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration

Unified Video Generation via Next-Set Prediction in Continuous Domain

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

Visual Intention Grounding for Egocentric Assistants

Omni-scene Perception-oriented Point Cloud Geometry Enhancement for Coordinate Quantization

Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

Learning to Generalize without Bias for Open-Vocabulary Action Recognition

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View

Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion

Training-Free Industrial Defect Generation with Diffusion Models

Feature Decomposition-Recomposition in Large Vision-Language Model for Few-Shot Class-Incremental Learning

When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training

TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation