Skip to yearly menu bar
Skip to main content
Main Navigation
Create Profile
Reset Password
My Stuff
Login
Getting Started
Schedule
Main Conference
Keynotes
Orals
Papers
Paper Awards
Workshops
Sponsors
Organizers
Tutorials
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
Event-guided HDR Reconstruction with Diffusion Priors
RePoseD: Efficient Relative Pose Estimation With Known Depth Information
Lifting the Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild
GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation
PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors
FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases
KDA: Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications
CARP: Coarse-to-Fine Autoregressive Prediction for Visuomotor Policy Learning
Gradient Extrapolation for Debiased Representation Learning
Memory-Efficient 4-bit Preconditioned Stochastic Optimization
X-Capture: An Open-Source Portable Device for Multi-Sensory Learning
TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
Video Color Grading via Look-Up Table Generation
MiDSummer: Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation
Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels
Leveraging Local Patch Alignment to Seam-cutting for Large Parallax Image Stitching
ETA: Energy-based Test-time Adaptation for Depth Completion
DePR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion Priors
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval
Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
DAViD: Data-efficient and Accurate Vision Models from Synthetic Data
Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data
EgoM2P: Egocentric Multimodal Multitask Pretraining
GReg: Geometry-Aware Region Refinement for Sign Language Video Generation
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
NeRF Is a Valuable Assistant for 3D Gaussian Splatting
SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer
RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather
ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring
Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
MeasureXpert: Automatic Anthropometric Measurement Extraction from Two Unregistered, Partial, Posed, and Dressed Body Scans
Aligning Constraint Generation with Design Intent in Parametric CAD
MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling
Federated Continual Instruction Tuning
``Principal Components" Enable A New Language of Images
Multi-identity Human Image Animation with Structural Video Diffusion
Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework
Text Embedding Knows How to Quantize Text-Guided Diffusion Models
HUMOTO: A 4D Dataset of Mocap Human Object Interactions
Epona: Autoregressive Diffusion World Model for Autonomous Driving
Visual-RFT: Visual Reinforcement Fine-Tuning
Jigsaw++: Imagining Complete Shape Prior for Object Reassembly
VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction
Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking
GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination
Region-based Cluster Discrimination for Visual Representation Learning
Online Generic Event Boundary Detection
I2VControl: Disentangled and Unified Video Motion Synthesis Control
InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
LIRA: Reasoning Reconstruction via Multimodal Large Language Models
Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D
Sim-DETR: Unlock DETR for Temporal Sentence Grounding
A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention
MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection
Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation
Depth Any Event Stream: Enhancing Event-based Monocular Depth Estimation via Dense-to-Sparse Distillation
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling
Attention to the Burtiness in Visual Prompt Tuning!
Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity
SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding
Spectral Image Tokenizer
Selective Contrastive Learning for Weakly Supervised Affordance Grounding
V2XScenes: A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception
TeRA : Rethinking Text-driven Realistic 3D Avatar Generation
Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction
Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity
Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors
A Recipe for Generating VR Worlds from a Single Image
Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis
LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation
FlowR: Flowing from Sparse to Dense 3D Reconstructions
3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
SuMa: A Subspace Mapping Approach for Complete and Effective Concept Erasure in Text-to-Image Diffusion Models
Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training
CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
imHead: A large-scale implicit morphable model for localized head modeling
Trial-Oriented Visual Rearrangement
DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization
Backdoor Defense via Enhanced Splitting and Trap Isolation
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator
TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
Multi-turn Consistent Image Editing
Egocentric Action-aware Inertial Localization in Point Clouds
Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows
Frequency-Guided Diffusion for Training-Free Text-Driven Image Translation
Learning Streaming Video Representation via Multitask Training
Unlocking the Potential of Diffusion Priors in Blind Face Restoration
Self-Equilibrated Online Data Balancing for Enhanced Concept Composition in Generation Models
Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning
Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding
ForCenNet: Foreground-Centric Network for Document Image Rectification
Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching
Iterative Prompt Relocation for Distribution-Adaptive Visual Prompt Tuning
Drawing Developmental Trajectory from Cortical Surface Reconstruction
SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling
HUST: High-Fidelity Unbiased Skin Tone Estimation via Texture Quantization
Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors
Time-Aware Auto White Balance in Mobile Photography
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation
Leveraging Spatial Invariance to Boost Adversarial Transferability
SDFormer: Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer
Training-Free Class Purification for Open-Vocabulary Semantic Segmentation
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering
Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
ART: Adaptive Relation Tuning for Generalized Relation Detection
Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor
SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection
Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting
A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
Salvaging the Overlooked: Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection
How Can Objects Help Video-Language Understanding?
DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation
A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Discontinuity-aware Normal Integration for Generic Central Camera Models
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction
PHD: Personalized 3D Human Body Fitting with Point Diffusion
LEGION: Learning to Ground and Explain for Synthetic Image Detection
Generic Event Boundary Detection via Denoising Diffusion
Scalable Image Tokenization with Index Backpropagation Quantization
Counting Stacked Objects
EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients
Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning
Open-ended Hierarchical Streaming Video Understanding with Vision Language Models
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
Occlusion-robust Stylization for Drawing-based 3D Animation
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models
PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups
Robust Low-light Scene Restoration via Illumination Transition
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation
CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning
Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning
Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing
VistaDream: Sampling multiview consistent images for single-view scene reconstruction
ZIM: Zero-Shot Image Matting for Anything
GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields
Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
What to Distill? Fast Knowledge Distillation with Adaptive Sampling
EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching
Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
Perspective-Invariant 3D Object Detection
Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines
One-Step Specular Highlight Removal with Adapted Diffusion Models
Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery
Adversarial Exploitation of Data Diversity Improves Visual Localization
$\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Model
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
MatchDiffusion: Training-free Generation of Match-Cuts
GARF: Learning Generalizable 3D Reassembly for Real-World Fractures
Borrowing Eyes for the Blind Spot: Overcoming Data Scarcity in Malicious Video Detection via Cross-Domain Retrieval Augmentation
MorphoGen: Efficient Unconditional Generation of Long-Range Projection Neuronal Morphology via a Global-to-Local Framework
AV-Flow: Transforming Text to Audio-Visual Human-like Interactions
SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization
Towards Video Turing Test: Video Comprehension and Reasoning Benchmark with Complex Visual Narratives
Coupling the Generator with Teacher for Effective Data-Free Knowledge Distillation
Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Conditional Visual Autoregressive Modeling for Pathological Image Restoration
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
ROAR: Reducing Inversion Error in Generative Image Watermarking
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Unknown Text Learning for CLIP-based Few-Shot Open-set Recognition
WarpHE4D: Dense 4D Head Map toward Full Head Reconstruction
PolarAnything: Diffusion-based Polarimetric Image Synthesis
Latent Expression Generation for Referring Image Segmentation and Grounding
PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction
Allowing Oscillation Quantization: Overcoming Solution Space Limitation in Low Bit-Width Quantization
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory
Decoding Correlation-Induced Misalignment in the Stable Diffusion Workflow for Text-to-Image Generation
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection
UniRes: Universal Image Restoration for Complex Degradations
BANet: Bilateral Aggregation Network for Mobile Stereo Matching
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Leaps and Bounds: An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction
ContextFace: Generating Facial Expressions from Emotional Contexts
SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View
ZeroStereo: Zero-shot Stereo Matching from Single Images
EventUPS: Uncalibrated Photometric Stereo Using an Event Camera
SpikeDiff: Zero-shot High-Quality Video Reconstruction from Sub-millisecond Chromatic Spike Streams
Learning Hierarchical Line Buffer for Image Processing
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
Trust but Verify: Programmatic VLM Evaluation in the Wild
CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection
Structured Policy Optimization: Enhance Large Vision-Language Model via Self-referenced Dialogue
SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning
Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification
Cultural Gaps in the Long Tail of Text-to-Image Models
How To Make Your Cell Tracker Say "I dunno!"
LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation
Embodied Navigation with Auxiliary Task of Action Description Prediction
Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array
Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning
CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering
FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process
MINERVA: Evaluating Complex Video Reasoning
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction
RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation
H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
Agreement aware and dissimilarity oriented GLOM
VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
Generalizable 4D Human Object Interaction Synthesis by Composing Interaction Primitives
EA-KD: Entropy-based Adaptive Knowledge Distillation
UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images
Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation
MonSTeR: a Unified Model for Motion, Scene, Text Retrieval
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effecitve Cross-Domain Learning
Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture-of-Experts Computing System on Edge
$\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection
Learning Counterfactually Decoupled Attention for Open-world Model Attribution
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition
Mitigating Catastrophic Overfitting in Fast Adversarial Training via Label Information Elimination
Who Controls the Authorization? Invertible Networks for Copyright Protection in Text-to-Image Synthesis
IGD: Instructional Graphic Design with Multimodal Layer Generation
Breaking Grid Constraints: Dynamic Graph Reconstruction Network for Multi-organ Segmentation
From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning
TerraMind: Large-Scale Generative Multimodality for Earth Observation
S$^{2}$ M$^{2}$: Scalable Stereo Matching Model for Reliable Depth Estimation
RetinexMCNet: A Memory Controller Dominated Network for Low-Light Video Enhancement Based on Retinex
ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds
Scheduling Weight Transitions for Quantization-Aware Training
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Efficient Fine-Tuning of Large Models via Nested Low-Rank Adaptation
Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation
Incremental Few-Shot Semantic Segmentation via Multi-Level Switchable Visual Prompts
HERO: Human Reaction Generation from Videos
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images
ReTracker: Exploring Image Matching for Robust Online Any Point Tracking
IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising
Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
G2PDiffusion: Cross-species Genotype-to-Phenotype Prediction via Evolutionary Diffusion
Environment-Agnostic Pose: Generating Environment-independent Object Representations for 6D Pose Estimation
Visual Relation Diffusion for Human-Object Interaction Detection
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Neural Shell Texture Splatting: More Details and Fewer Primitives
Detect Anything 3D in the Wild
LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints
FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
TRNAS: A Training-Free Robust Neural Architecture Search
Met$^2$Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems
Spatio-Spectral Pattern Illumination for Direct and Indirect Separation from a Single Hyperspectral Image
E-NeMF: Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes
LocalDyGS : Multi-view Global Dynamic Scene Modeling through Adaptive Local Feature Decoupling
CARIM: Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features
Demeter: A Parametric Model of Crop Plant Morphology from the Real World
Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation
Fast Image Super-Resolution via Consistency Rectified Flow
Visual Modality Prompt for Adapting Vision-Language Object Detectors
Forecasting Continuous Non-Conservative Dynamical Systems in $SO(3)$
StepGRPO: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration
Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
VertexRegen: Mesh Generation with Continuous Level of Detail
Neural Architecture Search Driven by Locally Guided Diffusion for Personalized Federated Learning
Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling
SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Accelerating Diffusion Transformer via Gradient-Optimized Cache
Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation
Video Individual Counting for Moving Drones
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models
Proactive Scene Decomposition and Reconstruction
ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement
Towards a Universal Image Degradation Model via Content-Degradation Disentanglement
Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning
EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text matching
Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM
BlueNeg: A 35mm Negative Film Dataset for Restoring Channel-Heterogeneous Deterioration
Combinative Matching for Geometric Shape Assembly
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization
Learning A Unified Template for Gait Recognition
Gait-X: Exploring X modality for Generalized Gait Recognition
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
Towards Safer and Understandable Driver Intention Prediction
Removing Out-of-Focus Reflective Flares via Color Alignment
LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement
Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge
Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks
MMAD: Multi-label Micro-Action Detection in Videos
BVINet: Unlocking Blind Video Inpainting with Zero Annotations
Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency
PEFTDiff: Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning
Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting
Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images
SHIFT: Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models
ViSpeak: Visual Instruction Feedback in Streaming Videos
Beyond Perspective: Neural 360-Degree Video Compression
SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection
BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models
SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
TopicGeo: An Efficient Unified Framework for Geolocation
GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation
Chimera: Improving Generalist Model with Domain-Specific Experts
Neural Compression for 3D Geometry Sets
Benchmarking Egocentric Visual-Inertial SLAM at City Scale
Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in Medical Image Segmentation with Learnable Prior
DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
$\text{CO}_2$-Net: A Physics-Informed Spatio-Temporal Model for Global Surface $\text{CO}_2$ Reconstruction
Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics
Loss Functions for Predictor-based Neural Architecture Search
CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor
Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
Autoregressive Denoising Score Matching is a Good Video Anomaly Detector
Learning to Generalize without Bias for Open-Vocabulary Action Recognition
Separation for Better Integration: Disentangling Edge and Motion in Event-based Deblurring
Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning
RA-BUSSeg: Relation-aware Semi-supervised Breast Ultrasound Image Segmentation via Adjacent Propagation and Cross-layer Alignment
PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
DLF: Extreme Image Compression with Dual-generative Latent Fusion
Augmented Mass-Spring Model for Real-Time Dense Hair Simulation
GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting
A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks
Textured 3D Regenerative Morphing with 3D Diffusion Prior
GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule
Information-Bottleneck Driven Binary Neural Network for Change Detection
UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
GFPack++: Attention-Driven Gradient Fields for Optimizing 2D Irregular Packing
Stealthy Backdoor Attack in Federated Learning via Adaptive Layer-wise Gradient Alignment
HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity
Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction
DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads
InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians
RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis
Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Thermal Polarimetric Multi-view Stereo
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation
RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
NeuFrameQ: Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation
Automated Red Teaming for Text-to-Image Models through Feedback-Guided Prompt Iteration with Vision-Language Models
Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
AstroLoc: Robust Space to Ground Image Localizer
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization
High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
ChartCap: Mitigating Hallucination of Dense Chart Captioning
A Unified Interpretation of Training-Time Out-of-Distribution Detection
Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition
Axis-level Symmetry Detection with Group-Equivariant Representation
Lark: Low-Rank updates after knowledge localization for Few-shot Class-Incremental Learning
Diversity-Enhanced Distribution Alignment for Dataset Distillation
CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation
Dynamic Dictionary Learning for Remote Sensing Image Segmentation
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views
Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling
Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
Towards Fine-grained Interactive Segmentation in Images and Videos
LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion
Learnable Retrieval Enhanced Visual-Text Alignment and Fusion for Radiology Report Generation
MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation
Quanta Vision: From Photons to Perception
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
Toward Material-Agnostic System Identification from Videos
MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging
Can We Achieve Efficient Diffusion Without Self-Attention? Distilling Self-Attention into Convolutions
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams
Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution
Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion
EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching
Online Dense Point Tracking with Streaming Memory
Supervised Exploratory Learning for Long-Tailed Visual Recognition
FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling
Growing a Twig to Accelerate Large Vision-Language Models
DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
An Inversion-based Measure of Memorization for Diffusion Models
The Source Image is the Best Attention for Infrared and Visible Image Fusion
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection
Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection
Open-Vocabulary Octree-Graph for 3D Scene Understanding
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Adapt Foundational Segmentation Models with Heterogeneous Searching Space
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
FIND: Few-Shot Anomaly Inspection with Normal-Only Multi-Modal Data
Medical World Model
Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting
Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation
Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis
Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting
Preacher: Paper-to-Video Agentic System
Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
Fine-Tuning Visual Autogressive Models for Subject-Driven Generation
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
OpenSubstance: A High-quality Measured Dataset of Multi-View and -Lighting Images and Shapes
Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba for End-to-end Whole Slide Image Analysis
Reference-based Super-Resolution via Image-based Retrieval-Augmented Generation Diffusion
Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion
MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration
WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection
FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers
MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting
Where, What, Why: Towards Explainable Driver Attention Prediction
ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
End-to-End Driving with Online Trajectory Evaluation via BEV World Model
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding
FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment
Transparent Vision: A Theory of Hierarchical Invariant Representations
RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration
SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation
SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition
ISP2HRNet: Learning to Reconstruct High Resolution Image from Irregularly Sampled Pixels via Hierarchical Gradient Learning
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer
Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI
WorldScore: Unified Evaluation Benchmark for World Generation
Zero-Shot Depth Aware Image Editing with Diffusion Models
Controllable and Expressive One-Shot Video Head Swapping
GAS: Generative Avatar Synthesis from a Single Image
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
A Unified Framework for Industrial Cel-Animation Colorization with Temporal-Structural Awareness
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
OuroMamba: A Data-Free Quantization Framework for Vision Mamba
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness
Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild
RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning
DIVE: Taming DINO for Subject-Driven Video Editing
Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation
Exploring View Consistency for Scene-Adaptive Low-Light Light Field Image Enhancement
DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering
Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal
Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery
Accelerating Diffusion Sampling via Exploiting Local Transition Coherence
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
WonderTurbo: Generating Interactive 3D World in 0.72 Seconds
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
Forgetting Through Transforming: Enabling Federated Unlearning via Class-Aware Representation Transformation
HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss
VAGUE: Visual Contexts Clarify Ambiguous Expressions
MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
Region-Level Data Attribution for Text-to-Image Generative Models
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors
Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability
MotionFollower: Editing Video Motion via Score-Guided Diffusion
DACoN: DINO for Anime Colorization with Any Number of Reference Images
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses
Token Activation Map to Visually Explain Multimodal LLMs
FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
FreeDance: Towards Harmonic Free-Number Group Dance Generation via a Unified Framework
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
iManip: Skill-Incremental Learning for Robotic Manipulation
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance
MBTI: Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation
ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
Secure On-Device Video OOD Detection Without Backpropagation
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
TAD-E2E: A Large-scale End-to-end Autonomous Driving Dataset
Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning
MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
Towards Efficient General Feature Prediction in Masked Skeleton Modeling
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion
Domain Generalizable Portrait Style Transfer
Prompt-driven Transferable Adversarial Attack on Person Re-Identification with Attribute-aware Textual Inversion
Large-scale Pre-training for Grounded Video Caption Generation
Improved Noise Schedule for Diffusion Training
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations
Monocular Semantic Scene Completion via Masked Recurrent Networks
FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
TokensGen: Harnessing Condensed Tokens for Long Video Generation
StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS
Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
WINS: Winograd Structured Pruning for Fast Winograd Convolution
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images
Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime
STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene
DuCos: Duality Constrained Depth Super-Resolution via Foundation Model
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Dataset Ownership Verification for Pre-trained Masked Models
Occupancy Learning with Spatiotemporal Memory
GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations
Object-level Correlation for Few-Shot Segmentation
Debiased Curriculum Adaptation for Safe Transfer Learning in Chest X-ray Classification
DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
Multi-View 3D Point Tracking
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior
GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views
LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching
CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision Language Model
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Local Dense Logit Relations for Enhanced Knowledge Distillation
Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation
Multi-modal Segment Anything Model for Camouflaged Scene Segmentation
Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection
On-Device Diffusion Transformer Policy for Efficient Robot Manipulation
LoRD-HOI: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation
Zero-Shot Compositional Video Learning with Coding Rate Reduction
Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin
FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching
Multi-Schema Proximity Network for Composed Image Retrieval
SeHDR: Single-Exposure HDR Scene Reconstruction via 3D Gaussian Bracketing
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Self-supervised Learning of Hybrid Part-aware 3D Representation of 2D Gaussians and Superquadrics
Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
MSA$^2$: Multi-task Framework with Structure-aware and Style-adaptive Character Representation for Open-set Chinese Text Recognition
GSOT3D: Towards Generic 3D Single Object Tracking in the Wild
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Training-Free Generation of Temporally Consistent Rewards from VLMs
FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models
InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving
PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing
AVAM: a Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
Text-to-Any-Skeleton Motion Generation Without Retargeting
G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
AutoScape: Geometry-Consistent Long-Horizon Scene Generation
$\Phi$-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data
Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes.
When Anchors Meet Cold Diffusion: A Multi-Stage Approach to Lane Detection
UAVScenes: A Multi-Modal Dataset for UAVs
Embodied Representation Alignment with Mirror Neurons
Bolt3D: Generating 3D Scenes in Seconds
An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval
Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generation
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
DDB: Diffusion Driven Balancing to Address Spurious Correlations
OneGT: One-Shot Geometry-Texture Neural Rendering for Head Avatars
Phantom: Subject-consistent video generation via cross-modal alignment
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Temperature in Cosine-based Softmax Loss
Verbalized Representation Learning for Interpretable Few-Shot Generalization
Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal
AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
CAVIS: Context-Aware Video Instance Segmentation
Music Grounding by Short Video
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
Modeling Saliency Dataset Bias
LLM Thought Divergence and Convergence for Dialogue-Based Image Generation Control
VALLR: Visual ASR Language Model for Lip Reading
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity
Omni-scene Perception-oriented Point Cloud Geometry Enhancement for Coordinate Quantization
Privacy-centric Deep Motion Retargeting for Anonymization of Skeleton-Based Motion Visualization
Training-Free Text-Guided Image Editing with Visual Autoregressive Model
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation
Prior-aware Dynamic Temporal Modeling Framework for Sequential 3D Hand Pose Estimation
Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion
SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders
DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
FaceXFormer: A Unified Transformer for Facial Analysis
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
HyPiDecoder: Hybrid Pixel Decoder for Efficient Segmentation and Detection
ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
Inpaint4Drag: Drag-based Image Editing via Bidirectional Warping and Inpainting
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs
DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization
DreamFuse: Adaptive Image Fusion with Diffusion Transformer
Event-Driven Storytelling with Multiple Lifelike Humans in a 3D scene
From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
Multi-Modal Few-Shot Temporal Action Segmentation
Timestep-Aware Diffusion Model for Extreme Image Rescaling
VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration
Towards Open-World Generation of Stereo Images and Unsupervised Matching
SD$^2$Actor: Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images
Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
Enhanced Pansharpening via Quaternion Spatial-Spectral Interactions
Epipolar Consistent Attention Aggregation Network for Unsupervised Light Field Disparity Estimation
Improving Noise Efficiency in Privacy-preserving Dataset Distillation
PBFG: A New Physically-Based Dataset and Removal of Lens Flares and Glares
LDIP: Long Distance Information Propagation for Video Super-Resolution
TurboVSR: Fantastic Video Upscalers and Where to Find Them
STDDNet: Harnessing Mamba for Video Polyp Segmentation via Spatial-aligned Temporal Modeling and Discriminative Dynamic Representation Learning
Diffusion-based Source-biased Model for Single Domain Generalized Object Detection
Riemannian-Geometric Fingerprints of Generative Models
InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild
Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
BlinkTrack: Feature Tracking over 80 FPS via Events and Images
Federated Representation Angle Learning
Visual Surface Wave Tomography: Revealing Subsurface Physical Properties via Visible Surface Waves
YOLOE: Real-Time Seeing Anything
SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
CWNet: Causal Wavelet Network for Low-Light Image Enhancement
Streamlining Image Editing with Layered Diffusion Brushes
MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
Stable Diffusion Models are Secretly Good at Visual In-Context Learning
Intra-modal and Cross-modal Synchronization for Audio-visual Deepfake Detection and Temporal Localization
Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation
MUNBa: Machine Unlearning via Nash Bargaining
MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction
UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis
Prototype-based Contrastive Learning with Stage-wise Progressive Augmentation for Self-Supervised Fine-Grained Learning
RogSplat: Robust Gaussian Splatting via Generative Priors
FonTS: Text Rendering With Typography and Style Controls
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM
ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
Vision-Language Models Can't See the Obvious
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting
ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis
Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training
CE-FAM: Concept-Based Explanation via Fusion of Activation Maps
Golden Noise for Diffusion Models: A Learning Framework
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement
CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation
Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation
SFUOD: Source-Free Unknown Object Detection
Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal
E-SAM: Training-Free Segment Every Entity Model
PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
ARMO: Autoregressive Rigging for Multi-Category Objects
Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning
Straighten Viscous Rectified Flow via Noise Optimization
ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy
Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation
PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
SpikePack: Enhanced Information Flow in Spiking Neural Networks with High Hardware Compatibility
Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
Debiased Teacher for Day-to-Night Domain Adaptive Object Detection
One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models
Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement
VCA: Video Curious Agent for Long Video Understanding
Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
DecAD: Decoupling Anomalies in Latent Space for Multi-Class Unsupervised Anomaly Detection
Seam360GS: Seamless 360° Gaussian Splatting from Real-World Omnidirectional Images
AJAHR: Amputated Joint Aware 3D Human Mesh Recovery
Boundary Probing for Input Privacy Protection When Using LMM Services
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
Hierarchical Divide-and-Conquer Grouping for Classification Adaptation of Pre-Trained Models
An Empirical Study of Autoregressive Pre-training from Videos
TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction
The Devil is in the Spurious Correlation: Boosting Moment Retrieval with Dynamic Learning
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition
LUSD: Localized Update Score Distillation for Text-Guided Image Editing
Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification
Error Recognition in Procedural Videos using Generalized Task Graph
PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency
Gradient Decomposition and Alignment for Incremental Object Detection
Heatmap Regression without Soft-Argmax for Facial Landmark Detection
Generating Multi-Image Synthetic Data for Text-to-Image Customization
LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models
Head2Body: Body pose generation from Multi-sensory Head-mounted Inputs
3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation
P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction
KinMo: Kinematic-aware Human Motion Understanding and Generation
D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion
TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration
V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
Not All Degradations Are Equal: A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution
Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance
ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Diven Adversarial Prompts
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
MixA: A Mixed Attention approach with Stable Lightweight Linear Attention to enhance Efficiency of Vision Transformers at the Edge
SignRep: Enhancing Self-Supervised Sign Representations
Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design
Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints
Training-free Geometric Image Editing on Diffusion Models
Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning
Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion
TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
Effective Training Data Synthesis for Improving MLLM Chart Understanding
Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation
C4D: 4D Made from 3D through Dual Correspondences
Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency
Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation
Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis
MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions
Learning Beyond Still Frames: Scaling Vision-Language Models with Video
MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
Q-Norm: Robust Representation Learning via Quality-Adaptive Normalization
The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation
Motion Synthesis with Sparse and Flexible Keyjoint Control
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
ResidualViT for Efficient Temporally Dense Video Encoding
Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need
$\textit{FaceLift}$: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads
I2V3D: Controllable image-to-video generation with 3D guidance
StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data
CAD-Recode: Reverse Engineering CAD Code from Point Clouds
Adversarial Purification via Super-Resolution and Diffusion
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction
Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition
Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter
Towards Annotation-Free Evaluation: KPAScore for Human Keypoint Detection
Passing the Driving Knowledge Test
SMP-Attack: Boosting the Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout
HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network
Images as Noisy Labels: Unleashing the Potential of the Diffusion Model for Open-Vocabulary Semantic Segmentation
DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
Tree-NeRV: Efficient Non-Uniform Sampling for Neural Video Representation via Tree-Structured Feature Grids
Robust Adverse Weather Removal via Spectral-based Spatial Grouping
Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion
Enhancing Transferability of Targeted Adversarial Examples via Inverse Target Gradient Competition and Spatial Distance Stretching
One-Shot Knowledge Transfer for Scalable Person Re-Identification
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
CAFA: a Controllable Automatic Foley Artist
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer
Unified Adversarial Augmentation for Improving Palmprint Recognition
Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
MoFRR: Mixture of Diffusion Models for Face Retouching Restoration
TimeBooth: Disentangled Facial Invariant Representation for Diverse and Personalized Face Aging
RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications
Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method
GAP: Gaussianize Any Point Clouds with Text Guidance
MRGen: Segmentation Data Engine For Underrepresented MRI Modalities
Implicit Counterfactual Learning for Audio-Visual Segmentation
DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
Semantic-guided Camera Ray Regression for Visual Localization
SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
Joint Asymmetric Loss for Learning with Noisy Labels
Zero-shot Inexact CAD Model Alignment from a Single Image
PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
Robust Dataset Condensation using Supervised Contrastive Learning
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
LGA-Net: Learning Local and Global Affinities for Sparse Scribble based Image Colorization
CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds
Hallucinatory Image Tokens: A Training-free EAZY Approach to Detecting and Mitigating Object Hallucinations in LVLMs
Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes
CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences
LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
HumorDB: Can AI understand graphical humor?
Tensor-aggregated LoRA in Federated Fine-tuning
Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking
Neural Multi-View Uncalibrated Photometric Stereo without Photometric Stereo Cues
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
Enhancing Transformers Through Conditioned Embedded Tokens
Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
SpecGuard: Spectral Projection-based Advanced Invisible Watermarking
CaliMatch: Adaptive Calibration for Improving Safe Semi-supervised Learning
TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity
Toward Better Out-painting: Improving the Image Composition with Initialization Policy Model
When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack
Unfolding-Associative Encoder-Decoder Network with Progressive Alignment for Pansharpening
Task-Aware Prompt Gradient Projection for Parameter-Efficient Tuning Federated Class-Incremental Learning
Boosting Multimodal Learning via Disentangled Gradient Learning
GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting
OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS
Causality-guided Prompt Learning for Vision-language Models via Visual Granulation
HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation
Progressive Artwork Outpainting via Latent Diffusion Models
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
GUAVA:Generalizable Upper Body 3D Gaussian Avatar
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR
From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering
Ensemble Foreground Management for Unsupervised Object Discovery
Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts
RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation
ERNet: Efficient Non-Rigid Registration Network for Point Sequences
LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Web Artifact Attacks Disrupt Vision Language Models
RegionFocus: Visual Test-time Scaling for GUI Agents
Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning
VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models
SPD: Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection
INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping
FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields
Bayesian-Inspired Space-Time Superpixels
AnimalClue: Recognizing Animals by their Traces
Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?
When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection
LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities
SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching
Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry
Adversarial Training for Probabilistic Robustness
CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers
PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening
All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness
FRET: Feature Redundancy Elimination for Test Time Adaptation
SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
Preserve Anything: Controllable Image Synthesis with Object Preservation
Addressing Attribute Leakage in Text Embeddings for Image Editing with Diffusion Models
Recognizing Actions from Robotic View for Natural Human-Robot Interaction
FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
Parametric Shadow Control for Portrait Generation in Text-to-Image Diffusion Models
To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models
Context-Aware Academic Emotion Dataset and Benchmark
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
Reinforcement Learning-Guided Data Selection via Redundancy Assessment
Unbiased Missing-modality Multimodal Learning
SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality
Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos
OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving
Deep Adaptive Unfolded Network via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening
Scalable Ranked Preference Optimization for Text-to-Image Generation
Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
TryOn-Refiner: Conditional Rectified-flow-based TryOn Refiner for More Accurate Detail Reconstruction
Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model
Continuous-Time Human Motion Field from Events
Serialization based Point Cloud Oversegmentation
Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations
X-Fusion: Introducing New Modality to Frozen Large Language Models
Refer to Any Segmentation Mask Group With Vision-Language Prompts
AlignGuard: Scalable Safety Alignment for Text-to-Image Generation
SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior
Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
${\rm \bf EYE}^{\bf 3}$:Turn Anything into Naked-eye 3D
Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation
ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning
Generalizable Object Re-Identification via Visual In-Context Prompting
MVGBench: a Comprehensive Benchmark for Multi-view Generation Models
A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds
X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance
FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling
One Last Attention for Your Vision-Language Model
Grouped Speculative Decoding for Autoregressive Image Generation
GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Long-Tailed Classification with Multi-Granularity Semantics
Global-Aware Monocular Semantic Scene Completion with State Space Models
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers
VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching
Agent-free Breast Cancer Diagnosis and Prognosis via Latent Diffusion Enhancement
VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation
Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence
STEP-DETR: Advancing DETR-based Semi-Supervised Object Detection with Super Teacher and Pseudo-Label Guided Text Queries
DM-EFS: Dynamically Multiplexed Expanded Features Set Form for Robust and Efficient Small Object Detection
Dissecting CLIP: Decomposition with a Schur Complement-based Approach
Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding
Keep Your Friends Close, and Your Enemies Farther: Distance-aware Voxel-wise Contrastive Learning for Semi-supervised Multi-organ Segmentation
Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
EVDM: Event-based Real-world Video Deblurring with Mamba
Exploiting Diffusion Prior for Task-driven Image Restoration
Capturing head avatar with hand contacts from a monocular video
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation
MR-FIQA: Face Image Quality Assessment with Multi-Reference Representations from Synthetic Data Generation
ArtEditor: Learning Customized Instructional Image Editor from Few-Shot Examples
Incremental 3D Gaussian Localization for Image-goal Navigation
On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations
Cross-Category Subjectivity Generalization for Style-Adaptive Sketch Re-ID
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction
Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding
Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes
Hierarchical Material Recognition from Local Appearance
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
Factorized Learning for Temporally Grounded Video Language Models
Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning
CarGait: Cross-Attention based Re-ranking for Gait recognition
DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Model
Heavy Labels Out! Dataset Distillation with Label Space Lightening
MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling
R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception
Flexi-FSCIL: Adaptive Knowledge Retention for Breaking the Stability-Plasticity Dilemma in Few-Shot Class-Incremental Learning
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads
Stereo Any Video: Temporally Consistent Stereo Matching
D3: Training-Free AI-Generated Video Detection Using Second-Order Features
Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation
Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes
ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement
ARGUS: Hallucination and Omission Evaluation in Video-LLMs
GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors
SP$^2$T: Sparse Proxy Attention for Dual-stream Point Transformer
RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency
VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders
Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts
Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
Edicho: Consistent Image Editing in the Wild
Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer
DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA
Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models
Expressive Talking Human from Single-Image with Imperfect Priors
Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation
Correspondence-Free Fast and Robust Spherical Point Pattern Registration
ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving
ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World
A Unified Framework for Motion Reasoning and Generation in Human Interaction
Improving Large Vision and Language Models by Learning from a Panel of Peers
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation
AnnofreeOD: Detecting All Classes at Low Frame Rates Without Human Annotations
DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation
Auto-Controlled Image Perception in MLLMs via Visual Perception Tokens
EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment
Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints
VSSD: Vision Mamba with Non-Causal State Space Duality
MDP$^3$: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Shape of Motion: 4D Reconstruction from a Single Video
RALoc: Enhancing Outdoor LiDAR Localization via Rotation Awareness
VGGSounder: Audio-Visual Evaluations for Foundation Models
Generate, Transduct, Adapt: Iterative Transduction with VLMs
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns
ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detect
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations
Inter Inertial Poser: Multi-Human Motion Tracking from Sparse Inertial Sensors and Pairwise Inter-Sensor Distances
SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking
Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection
Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Versatile Transition Generation with Image-to-Video Diffusion
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
Backdoor Mitigation by Distance-Driven Detoxification
Local Scale Equivariance with Deep Equilibrium Canonicalizer in the Latent Space
ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment
Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
Temporal Rate Reduction Clustering for Human Motion Segmentation
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
MOSCATO: Predicting Multiple Object State Change Through Actions
RareCLIP: Rarity-aware Online Zero-shot Industrial Anomaly Detection
Triad: Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process
GMMamba: Group Masking Mamba for Whole Slide Image Classification
ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation
Genflow3D: Generative scene flow estimation and prediction on point cloud sequences
ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues
CIARD: Cyclic Iterative Adversarial Robustness Distillation
Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
Inference-Time Diffusion Model Distillation
PossLoss: A Reliable and Sensitive Facial Landmark Detection Loss Function
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast
ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis.
Visual Textualization for Image Prompted Object Detection
AnyPortal: Zero-Shot Consistent Video Background Replacement
NormalCrafter: Learning Temporally Consistent Video Normal from Video Diffusion Priors
Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting
Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization
PLAN: Proactive Low-Rank Allocation for Continual Learning
AllGCD: Leveraging All Unlabeled Data for Generalized Category Discovery
ModSkill: Physical Character Skill Modularization
UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation
DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception
Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes
COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets
Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration
Scaling Language-Free Visual Representation Learning
Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning
DisTime: Distribution-based Time Tokenizer for Temporal Localization with Video Large Language Model
Large Scale Video Continual Learning with Bootstrapped Compression
VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference
DEPTHOR:Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
All Parts Matter: A Unified Mask-Free Virtual Try-On Framework
IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision
NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement
SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures
SILO: Solving Inverse Problems with Latent Operators
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Efficient 3D Gaussian Splatting with Compressed Model Training
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
ReCoT: Reflective Self-Correction Training for Mitigating Confirmation Bias in Large Vision-Language Models
HPSv3: Towards Full-Spectrum Human Preference Score
Active Perception Meets Rule-Guided RL: A Two-Phase Approach for Precise Object Navigation in Complex Environments
Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models
Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
Discovering Divergent Representations between Text-to-Image Models
Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts
Latent Swap Joint Diffusion for 2D Long-Form Latent Generation
Membership Inference Attacks with False Discovery Rate Control
UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation
CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance
A Framework for Double-Blind Federated Adaptation of Foundation Models
ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
DOGE : Towards Versatile Visual Document Grounding and Referring
Debiasing Trace Guidance: Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
DAMap: Distance-aware MapNet for High Quality HD Map Construction
You Think, You ACT: The New Task of Arbitrary Text to Motion Generation
Rethinking Layered Graphic Design Generation with a Top-Down Approach
Video-T1: Test-time Scaling for Video Generation
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
End-to-End Multi-Modal Diffusion Mamba
KV-Edit: Training-Free Image Editing for Precise Background Preservation
SDMatte: Grafting Diffusion Models for Interactive Matting
Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning
Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition
PRM: Photometric Stereo based Large Reconstruction Model
Referring Expression Comprehension for Small Objects
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild
Prototype Guided Backdoor Defense
Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training
Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing
LLM-assisted Entropy-based Adaptive Distillation for Unsupervised Fine-grained Visual Representation Learning
FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization
DiffPCI: Large Motion Point Cloud frame Interpolation with Diffusion Model
MultiModal Representation for MultiSensory Video Simulation
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations
Authentic 4D Driving Simulation with a Video Generation Model
TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions
FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation
CAT: A Unified Click-and-Track Framework for Realistic Tracking
Real3D: Towards Scaling Large Reconstruction Models with Real Images
Progressive Test Time Energy Adaptation for Medical Image Segmentation
CODE-CL: Conceptor-Based Gradient Projection for Deep Continual Learning
FREE-Merging: Fourier Transform for Efficient Model Merging
OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving
HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
Balanced Sharpness-Aware Minimization for Imbalanced Regression
SL$^{2}$A-INR: Single-Layer Learnable Activation for Implicit Neural Representation
Neural Solver of Dichromatic Reflection Model for Specular Highlight Removal
REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents
Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning
Bridging the Gap between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding
RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
Face Retouching with Diffusion Data Generation and Spectral Restorement
HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding
TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests
Global and Local Entailment Learning for Natural World Imagery
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Robust Unfolding Network for HDR imaging with Modulo Cameras
AIRA: Activation-Informed Low-Rank Adaptation for Large Models
SynTag: Enhancing the Geometric Robustness of Inversion-based Generative Image Watermarking
DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery
Rectifying Magnitude Neglect in Linear Attention
Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection
ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching
Generalization-Preserved Learning: Closing the Backdoor to Catastrophic Forgetting in Continual Deepfake Detection
Make Your Training Flexible: Towards Deployment-Efficient Video Models
PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask
NeuralSVG: An Implicit Representation for Text-to-Vector Generation
WeaveSeg: Iterative Contrast-weaving and Spectral Feature-refining for Nuclei Instance Segmentation
Gaussian-based World Model: Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction
Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function
A Tiny Change, A Giant Leap: Long-Tailed Class-Incremental Learning via Geometric Prototype Alignment
GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
Blind2Sound: Self-Supervised Image Denoising without Residual Noise
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images
ClaraVid: A Holistic Scene Reconstruction Benchmark from Aerial Perspective with Delentropy-Based Complexity Profiling
ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion
DiffRefine: Diffusion-based Proposal Specific Densification for Point Cloud Object Detection
A Quality-Guided Mixture of Score-fusion Experts Framework for Human Recognition
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Neuromanifold-Regularized KANs for Shape-fair Feature Representations
Identity Preserving 3D Head Stylization with Multiview Score Distillation
Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
DiffIP: Representation Fingerprints for Robust IP Protection of Diffusion Models
Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing
Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation
PixelStitch: Structure-Preserving Pixel-Wise Bidirectional Warps for Unsupervised Image Stitching
A Good Teacher Adapts Their Knowledge for Distillation
Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform
TOTP: Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion
Progressive Distribution Bridging: Unsupervised Adaptation for Large-scale Pre-trained Models via Adaptive Auxiliary Data
SimBoost: Improving Real-World Driving via Simulated Hard-Case
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
CMAD: Correlation-Aware and Modalities-Aware Distillation for Multimodal Sentiment Analysis with Missing Modalities
Stochastic Gradient Estimation for Higher-Order Differentiable Rendering
Boosting Adversarial Transferability via Negative Hessian Trace Regularization
MMOne: Representing Multiple Modalities in One Scene
WIR3D: Semantic and Geometric-Aware 3D Shape Abstraction
SpectralAR: Spectral Autoregressive Visual Generation
Cassic: Towards Content-Adaptive State-Space Models for Learned Image Compression
Generative Adversarial Diffusion
Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility
Learning Neural Scene Representation from iToF Imaging
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation
Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence
Language Driven Occupancy Prediction
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation
Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information
Wide2Long: Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation
EDM: Efficient Deep Feature Matching
Supercharging Floorplan Localization with Semantic Rays
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
Auxiliary Prompt Tuning of Vision-Language Models for Out-of-Distribution Detection
Penalizing Boundary Activation for Object Completeness in Diffusion Models
Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration
Lidar Waveforms are Worth 40x128x33 Words
Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection
DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection
DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing
Vector Contrastive Learning For Pixel-Wise Pre-Training In Medical Vision
AG$^2$aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing
Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID
Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection
2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception
Open-Unfairness Adversarial Mitigation for Generalized Deepfake Detection
Improving Multimodal Learning via Imbalanced Learning
Federated domain generalization with domain-specific soft prompts generation
Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method
AgroBench: Vision-Language Model Benchmark in Agriculture
CAP: Evaluation of Persuasive and Creative Image Generation
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Efficient Concertormer for Image Deblurring and Beyond
CLIPSym: Delving into Symmetry Detection with CLIP
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm
Moment Quantization for Video Temporal Grounding
OVG-HQ: Online Video Grounding with Hybrid-modal Queries
3D Mesh Editing using Masked LRMs
Few-Shot Pattern Detection via Template Matching and Regression
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
Always skip connection
NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Leveraging Prior Knowledge of Diffusion Model for Person Search
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Large Scene Generation with Cube-Absorb Discrete Diffusion
Task-Specific Zero-shot Quantization-Aware Training for Object Detection
UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
Leveraging Debiased Cross-modal Attention Maps and Code-based Reasoning for Zero-shot Referring Expression Comprehension
Diffusion Image Prior
EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration
Diffusion Guided Adaptive Augmentation for Generalization in Visual Reinforcement Learning
CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
CogCM: Cognition-Inspired Contextual Modeling for Audio-Visual Speech Enhancement
Inverse 3D microscopy rendering for cell shape inference with active mesh
Multimodal Prompt Alignment for Facial Expression Recognition
Memory-Efficient Generative Models via Product Quantization
Diff$^2$I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior
AdsQA: Towards Advertisement Video Understanding
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection
Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection
Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement
DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover
Lightcity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions
VSC: Visual Search Compositional Text-to-Image Diffusion Model
Flow-MIL: Constructing Highly-expressive Latent Feature Space For Whole Slide Image Classification Using Normalizing Flow
Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Unsupervised Identification of Protein Compositions and Conformations via Implicit Content-Transformation Disentanglement
SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
TorchAdapt: Towards Light-Agnostic Real-Time Visual Perception
Staining and locking computer vision models without retraining
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Interpretable point cloud classification using multiple instance learning
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
MPBR: Multimodal Progressive Bidirectional Reasoning for Open-Set Fine-Grained Recognition
Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation
No Pose at All : Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
Event-based Tiny Object Detection: A Benchmark Dataset and Baselines
SceneMI: Motion In-betweening for Modeling Human-Scene Interaction
SIC: Similarity-Based Interpretable Image Classification with Neural Networks
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Instruction-based Image Editing with Planning, Reasoning, and Generation
Moderating the Generalization of Score-based Generative Model
GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting
PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction
Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors
Fast Globally Optimal and Geometrically Consistent 3D Shape Matching
FE-CLIP: Frequency Enhanced CLIP Model for Zero-Shot Anomaly Detection and Segmentation
Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
Disentangled Clothed Avatar Generation with Layered Representation
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness
SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations
Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training
Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation
SemiVisBooster: Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic Guidance
Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors
CoSMIC: Continual Self-supervised Learning for Multi-Domain Medical Imaging via Conditional Mutual Information Maximization
StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation
FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning
CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation
PseudoMapTrainer: Learning Online Mapping without HD Maps
TurboReg: TurboClique for Robust and Efficient Point Cloud Registration
FastJSMA: Accelerating Jacobian-based Saliency Map Attacks through Gradient Decoupling
Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos
FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions
An Efficient Hybrid Vision Transformer for TinyML Applications
Object-centric Video Question Answering with Visual Grounding and Referring
InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
Function-centric Bayesian Network for Zero-Shot Object Goal Navigation
Differentiable Room Acoustic Rendering with Multi-View Vision Priors
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Vision-Language Model Inference
Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning
Boosting MLLM Reasoning with Text-Debiased Hint-GRPO
Closed-Loop Transfer for Weakly-supervised Affordance Grounding
Online Language Splatting
PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image
Cross-Granularity Online Optimization with Masked Compensated Information for Learned Image Compression
Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection
HORT: Monocular Hand-held Objects Reconstruction with Transformers
CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
LONG3R: Long Sequence Streaming 3D Reconstruction
Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection
Customizing Domain Adapters for Domain Generalization
Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions
On the Robustness Tradeoff in Fine-Tuning
InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow
One Look is Enough: Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation
Deeply Supervised Flow-Based Generative Models
What we need is explicit controllability: Training 3D gaze estimator using only facial images
Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation
SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis
STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints
HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes
AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
LightSwitch: Multi-view Relighting with Material-guided Diffusion
MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features
Guiding Diffusion Models with Adaptive Negative Sampling Without External Resources
Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement
Uncalibrated Structure from Motion on a Sphere
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
TrackVerse: A Large-scale Dataset of Object Tracks for Visual Representation Learning
Long-Context State-Space Video World Models
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
SEAL: Semantic Aware Image Watermarking
AIM: Amending Inherent Interpretability via Self-Supervised Masking
GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Everything is a Video: Unifying Modalities through Next-Frame Prediction
MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost
AIM: Adaptive Inference of Multi-modal LLMs via Token Merging and Pruning
Generative Video Bi-flow
Pseudo-SD: Pseudo Controlled Stable Diffusion for Semi-Supervised and Cross-Domain Semantic Segmentation
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Toward Fair and Accurate Cross-Domain Medical Image Segmentation: A VLM-Driven Active Domain Adaptation Paradigm
Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images
Rethink Sparse Signals for Pose-guided Text-to-image Generation
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder
Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting
PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery
Auto-Vocabulary Semantic Segmentation
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior
Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing
Knowledge Distillation with Refined Logits
Temporal Overlapping Prediction: A Self-supervised Pre-training Method for Moving Object Segmentation
GIViC: Generative Implicit Video Compression
Gaussian Splatting with Discretized SDF for Relightable Assets
Backdoor Attacks on Neural Networks via One-Bit Flip
Your Text Encoder Can Be An Object-Level Watermarking Controller
OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM
Learning to See in the Extremely Dark
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning
ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy
Latent Diffusion Models with Masked AutoEncoders
MoSiC: Optimal-Transport Motion Trajectories for Dense Self-Supervised Learning
Similarity Memory Prior is All You Need for Medical Image Segmentation
Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing
Blind Video Super-Resolution based on Implicit Kernels
You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Underwater Visual SLAM with Depth Uncertainty and Medium Modeling
GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration
3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion
ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches
SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates
TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection
MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation
Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras
Less is More: Empowering GUI Agent with Context-Aware Simplification
Mixture-of-Scores: Robust Image-Text Data Quality Score via Three Lines of Code
Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection
LDPose: Towards Inclusive Human Pose Estimation for Limb-Deficient Individuals in the Wild
TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
MotionCtrl: A Real-time Controllable Vision-Language-Motion Model
Image as an IMU: Estimating Camera Velocity from a Single Motion-Blurred Image
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy
Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
S⁴M: Boosting Semi-Supervised Instance Segmentation with Segment Anything Model
Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images
NeurOp-Diff: Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion
egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks
Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization
Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
A Token-level Text Image Foundation Model for Document Understanding
ZFusion: Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior
LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association
PointGAC: Geometric-Aware Codebook for Masked Point Modeling
IFAdapter: Instance feature control for grounded Text-to-Image Generation
G$^{2}$SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection
$\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models
Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers
What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?
PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
STIV: Scalable Text and Image Conditioned Video Generation
Dual-level Prototype Learning for Composite Degraded Image Restoration
MMGeo: Multimodal Compositional Geo-Localization for UAVs
QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation
From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning
Test-Time Prompt Tuning for Zero-Shot Depth Completion
DMesh++: An Efficient Differentiable Mesh for Complex Shapes
Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation
Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge
Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy
ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition
Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction
ViLU: Learning Vision-Language Uncertainties for Failure Prediction
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
Constraint-Aware Feature Learning for Parametric Point Cloud
Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation
Blended Point Cloud Diffusion for Localized Text-guided Shape Editing
PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
A$^3$GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting
MSQ: Memory-Efficient Bit Sparsification Quantization
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
Instance-Level Video Depth in Groups Beyond Occlusions
OminiControl: Minimal and Universal Control for Diffusion Transformer
Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
When and Where do Data Poisons Attack Textual Inversion?
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
MMAT-1M: A Large CoT Dataset for Multimodal Agent Tuning
Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions
ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering
GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement
Cracking Instance Jigsaw Puzzles: A Superior Alternative to Multiple Instance Learning for Whole Slide Image Analysis
MambaML: Exploring State Space Models for Multi-Label Image Classification
Probabilistic Prototype Calibration of Vision-language Models for Generalized Few-shot Semantic Segmentation
End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation
SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Self-Supervised Sparse Sensor Fusion for Long Range Perception
Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers
RayZer: A Self-supervised Large View Synthesis Model
SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting
PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model
Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration
IM360: Large-scale Indoor Mapping with 360 Cameras
LangBridge: Interpreting Image as a Combination of Language Embeddings
StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth
FedXDS: Leveraging Model Attribution Methods to counteract Data Heterogeneity in Federated Learning
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
StyleSRN: Scene Text Image Super-Resolution with Text Style Embedding
Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment
BokehDiff: Neural Lens Blur with One-Step Diffusion
Superpowering Open-Vocabulary Object Detectors for X-ray Vision
CaO$_2$ : Rectifying Inconsistencies in Diffusion-Based Dataset Distillation
Kaputt: A Large-Scale Dataset for Visual Defect Detection
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision
FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation
GaussRender: Learning 3D Occupancy with Gaussian Rendering
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
Unified Multimodal Understanding via Byte-Pair Visual Encoding
Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior
Enhancing Numerical Prediction of MLLMs with Soft Labeling
Global Regulation and Excitation via Attention Tuning for Stereo Matching
Test-Time Retrieval-Augmented Adaptation for Vision-Language Models
VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling
VideoOrion: Tokenizing Object Dynamics in Videos
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
Target Bias Is All You Need: Zero-Shot Debiasing of Vision-Language Models with Bias Corpus
SUV: Suppressing Undesired Video Content via Semantic Modulation Based on Text Embeddings
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Optical Model-Driven Sharpness Mapping for Autofocus in Small Depth-of-Field and Severe Defocus Scenarios
Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables
Advancing Textual Prompt Learning with Anchored Attributes
Wave-MambaAD: Wavelet-driven State Space Model for Multi-class Unsupervised Anomaly Detection
Randomized Autoregressive Visual Generation
Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration
Mitigating Object Hallucinations via Sentence-Level Early Intervention
Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping
LayerAnimate: Layer-level Control for Animation
Model Explainability with Localized Soft Completeness
Holistic Tokenizer for Autoregressive Image Generation
LOTA: Bit-Planes Guided AI-Generated Image Detection
Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention
CharaConsist: Fine-Grained Consistent Character Generation
Gain-MLP: Improving HDR Gain Map Encoding via a Lightweight MLP
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Test-time Adaptation for Foundation Medical Segmentation Model Without Parametric Updates
Aligning Effective Tokens with Video Anomaly in Large Language Models
How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction
Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval
Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm
Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Personalized Federated Learning under Local Supervision
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
MH-LVC: Multi-Hypothesis Temporal Prediction for Learned Conditional Residual Video Coding
Democratizing High-Fidelity Co-Speech Gesture Video Generation
MOVE: Motion-Guided Few-Shot Video Object Segmentation
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention
MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction
MeshMamba: State Space Models for articulated 3D mesh generation and reconstruction
StableCodec: Taming One-Step Diffusion for Extreme Image Compression
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation
AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images
DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
HADES: Human Avatar with Dynamic Explicit Hair Strands
SparseVILA: Query-Aware Visual Sparsity Should Happen at Decoding
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
TCFG: Truncated Classifier-Free Guidance for Efficient and Scalable Text-to-Image Acceleration
From One to More: Contextual Part Latents for 3D Generation
FlowChef: Steering of Rectified Flow Models for Controlled Generations
EEGMirror: Leveraging EEG data in the wild via Montage-Agnostic Self-Supervision for EEG to Video Decoding
mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Towards Robustness of Person Search against Corruptions
$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting
Wasserstein Style Distribution Analysis and Transform for Stylized Image Generation
Beyond Blur: A Fluid Perspective on Generative Diffusion Models
A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition
CAPTURe: Evaluating Spatial Reasoning in Vision-Language Models through Counting Occluded Objects
Auto-Regressive Transformation for Image Alignment
Learning 3D Scene Analogies with Neural Contextual Scene Maps
FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance
VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation
Attention to Neural Plagiarism: Diffusion models Can Plagiarize Your Copyrighted Images!
A Conditional Probability Framework for Compositional Zero-shot Learning
Magic Insert: Style-Aware Drag-and-Drop
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
LayerD: Decomposing Raster Graphic Designs into Layers
Contrastive Flow Matching
Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning
HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation
LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models
PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning
Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration
Semi-ViM: Bidirectional State Space Model for Mitigating Label Imbalance in Semi-Supervised Learning
Controllable Latent Space Augmentation for Digital Pathology
MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation
LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
Learning Normal Flow Directly From Events
Mobile Video Diffusion
Hypergraph Clustering Network with Partial Attribute Imputation
Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
Language-Driven Multi-Label Zero-Shot Learning with Semantic Granularity
VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow
VideoAds: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro
WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation
SITE: towards Spatial Intelligence Thorough Evaluation
Real-time Streaming Depth Estimation at 2K Resolution
Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining
UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control
Faster and Better 3D Splatting via Group Training
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
Open-World Skill Discovery from Unsegmented Demonstration Videos
HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image
TransiT: Transient Transformer for Non-line-of-sight Videography
MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence
UNIS: A Unified Framework for Achieving Unbiased Neural Implicit Surfaces in Volume Rendering
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
MIEB: Massive Image Embedding Benchmark
Learning Implicit Features with Flow-Infused Transformations for Realistic Virtual Try-On
Dataset Distillation with Feature Matching through the Wasserstein Metric
Multimodal Large Language Model-Guided ISP Hyperparameter Optimization with Dynamic Preference Learning
VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild
Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection
RTMap: Real-Time Recursive Mapping with Change Detection and Localization
SPA: Efficient User-Preference Alignment against Uncertainty in Medical Image Segmentation
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-based Transformer
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity
RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors
An Information-Theoretic Regularizer for Lossy Neural Image Compression
Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising
Teaching VLMs to Localize Specific Objects from In-context Examples
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition
CODA: Repurposing Continuous VAEs for Discrete Tokenization
World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
Robustifying Zero-Shot Vision Language Models by Subspaces Alignment
Less Static, More Private: Towards Transferable Privacy-Preserving Action Recognition by Generative Decoupled Learning
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Seal Your Backdoor with Variational Defense
ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
MonoSOWA: Scalable monocular 3D Object detector Without human Annotations
Joint Diffusion Models in Continual Learning
HERA: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration
Joint Self-Supervised Video Alignment and Action Segmentation
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction
LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
From Imitation to Innovation: The Emergence of AI's Unique Artistic Styles and the Challenge of Copyright Protection
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones
Text2Outfit: Controllable Outfit Generation with Multimodal Language Models
Split-and-Combine: Enhancing Style Augmentation for Single Domain Generalization
GECO: Geometrically consistent embedding with lightspeed inference
GT-Loc: Unifying When and Where in Images through a Joint Embedding Space
PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions
VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition
Measuring the Impact of Rotation Equivariance on Aerial Object Detection
Advancing Visual Large Language Model for Multi-granular Versatile Perception
RESCUE: cRowd Evacuation Simulation via Controlling SDM-United charactErs
Understanding Flatness in Generative Models: Its Role and Benefits
T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
SHeaP: Self-supervised Head Geometry Predictor Learned via 2D Gaussians
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
SAS: Segment Any 3D Scene with Integrated 2D Priors
Balanced Image Stylization with Style Matching Score
Unified Open-World Segmentation with Multi-Modal Prompts
4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
Dream-to-Real: Leveraging Image Generation for Single-View Volumetric Reconstruction
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
WIPES: Wavelet-based Visual Primitives
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation
Is CLIP ideal? No. Can we fix it? Yes!
ProbMed: A Probabilistic Framework for Medical Multimodal Binding
NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction
Long Context Tuning for Video Generation
HDR Image Generation via Gain Map Decomposed Diffusion
GameFactory: Creating New Games with Generative Interactive Videos
TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images
Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Learning Efficient and Generalizable Human Representation with Human Gaussian Model
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding
Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
LHM: Animatable Human Reconstruction from a Single Image in One Second
U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration
HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
Acknowledging Focus Ambiguity in Visual Questions
TITAN: Query-Token based Domain Adaptive Adversarial Learning
Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis
Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
Splat-based 3D Scene Reconstruction with Extreme Motion-blur
Laboring on less labors: RPCA Paradigm for Pan-sharpening
Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation
AdaDCP: Learning an Adapter with Discrete Cosine Prior for Clear-to-Adverse Domain Generalization
Self-Supervised Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography
DIH-CLIP: Unleashing the Diversity of Multi-Head Self-Attention for Training-Free Open-Vocabulary Semantic Segmentation
ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting
Planar Affine Rectification from Local Changes of Scale and Orientation
Dual Domain Control via Active Learning for Remote Sensing Domain Incremental Object Detection
Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering
Hierarchical 3D Scene Graphs Construction Outdoors
HouseTour: A Virtual Real Estate A(I)gent
FLSeg: Enhancing Privacy and Robustness in Federated Learning under Heterogeneous Data via Model Segmentation
IDFace: Face Template Protection for Efficient and Secure Identification
DisenQ: Disentangling Q-Former for Activity-Biometrics
Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement
Meta-Learning Dynamic Center Distance: Hard Sample Mining for Learning with Noisy Labels
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering
Understanding Museum Exhibits using Vision-Language Reasoning
MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy
Spatial Preference Rewarding for MLLMs Spatial Understanding
Certifiably Optimal Anisotropic Rotation Averaging
IRASim: A Fine-Grained World Model for Robot Manipulation
DIMO: Diverse 3D Motion Generation for Arbitrary Objects
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
Punching Bag vs. Punching Person: Motion Transferability in Videos
VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
DAA$^\ast$: Deep Angular A Star For Image-Based Path Planning
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
Semantic Discrepancy-aware Detector for Image Forgery Identification
FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
VSRM: A Robust Mamba-Based Framework for Video Super-Resolution
Snakes and Ladders: Two Steps Up for VideoMamba
Learning to Inference Adaptively for Multimodal Large Language Models
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model
Vision-Language Interactive Relation Mining for Open-Vocabulary Scene Graph Generation
WalkVLM: Aid Visually Impaired People Walking by Vision Language Model
DiffDoctor: Diagnosing Image Diffusion Models Before Treating
DIP: Unsupervised Dense In-Context Post-training of Visual Representations
Adding Additional Control to One-Step Diffusion with Joint Distribution Matching
Seeing the Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Towards Performance Consistency in Multi-Level Model Collaboration
Learning Few-Step Diffusion Models by Trajectory Distribution Matching
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation
Uncertainty-Aware Gradient Stabilization for Small Object Detection
CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving
Radiant Foam: Real-Time Differentiable Ray Tracing
Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning
Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation
GlassWizard: Harvesting Diffusion Priors for Glass Surface Detection
Open-set Cross Modal Generalization via Multimodal Unified Representation
Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework
RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Guiding Noisy Condition Diffusion Models with Score-based Discriminator Correction
Is Tracking really more challenging in First Person Egocentric Vision?
Learned Image Compression with Hierarchical Progressive Context Modeling
Trade-offs in Image Generation: How Do Different Dimensions Interact?
O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching
Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation
CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
Weakly-Supervised Learning of Dense Functional Correspondences
InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation
GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Rethinking DPO-style Diffusion Aligning Frameworks
UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments
Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition
Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation
MeshPad: Interactive Sketch-Conditioned Artist-Designed Mesh Generation and Editing
Closed Loop Optimal Transport for Unsupervised Action Segmentation
QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation
Sparse Fine-Tuning of Transformers for Generative Tasks
M$^2$EIT:Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking
LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering
Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention
Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling
VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions
Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training
PVMamba: Parallelizing Vision Mamba via Dynamic State Aggregation
Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy
VRM: Knowledge Distillation via Virtual Relation Matching
Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced In-domain Knowledge Transferring
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
SALAD -- Semantics-Aware Logical Anomaly Detection
SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting
Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective
CompleteMe: Reference-based Human Image Completion
BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
Semi-supervised Concept Bottleneck Models
SAC-GNC: SAmple Consensus for adaptive Graduated Non-Convexity
DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Learning on the Go: A Meta-learning Object Navigation Model
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
Multi-modal Identity Extraction
NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals
PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior
Text-guided Visual Prompt DINO for Generic Segmentation
Haze_x0008_Flow: Revisit Haze Physical Model as ODE and Realistic Non-Homogeneous Haze Generation for Real-World Dehazing
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Training-free and Adaptive Sparse Attention for Efficient Long Video Generation
Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product
IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution
PixTalk: Controlling Photorealistic Image Processing and Editing with Language
DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes
Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds
RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors
After the Party: Navigating the Mapping From Color to Ambient Lighting
Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization
MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration
Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography
LBM: Latent Bridge Matching for Fast Image-to-Image Translation
MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
Knowledge-Guided Part Segmentation
Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
PASD: A Pixel-Adaptive Swarm Dynamics Approach for Unsupervised Low-Light Image Enhancement
Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
FaceShield: Defending Facial Image against Deepfake Threats
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
$G^{2}D$: Boosting Multimodal Learning with Gradient-Guided Distillation
BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner
Monocular Facial Appearance Capture in the Wild
Hierarchical Cross-modal Prompt Learning for Vision-Language Models
Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories
Boost 3D Reconstruction using Diffusion-based Intrinsic Estimation
NATRA: Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations
MCID: Multi-aspect Copyright Infringement Detection for Generated Images
StyleKeeper: Prevent content leakage via a negative visual query guidance
SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation
Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
What You Have is What You Track: Adaptive and Robust Multimodal Tracking
Can Knowledge be Transferred from Unimodal to Multimodal? Investigating the Transitivity of Multimodal Knowledge Editing
Purge-Gate: Efficient Backpropagation-Free Test-Time Adaptation for Point Clouds via Token purging
Precise Action-to-Video Generation Through Visual Action Prompts
UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification
What If: Understanding Motion Through Sparse Interactions
Visual Intention Grounding for Egocentric Assistant
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
Scaling Inference-time Search with Vision Value Model for Improved Visual Comprehension
Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction
Unified Video Generation via Next-Set Prediction in Continuous Domain
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding
Temporal-aware Query Routing for Real-time Video Instance Segmentation
Elucidating Vision Feature Spaces for Multimodal Neural Decoding
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions
Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting
Streaming VideoLLMs for Real-Time Procedural Video Understanding
AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
AGO: Adaptive Grounding for Open World 3D Occupancy Prediction
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts
Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference
SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors
Event-guided Unified Framework for Low-light Video Enhancement, Frame Interpolation, and Deblurring
Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation
OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection
monoVLN: Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation
Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation
Spatially-Varying Autofocus
Reverse Convolution and Its Applications to Image Restoration
Think Twice: Test-Time Reasoning for Robust CLIP Zero-Shot Classification
Explaining Human Preferences via Metrics for Structured Reconstruction
Adversarial Attention Perturbations for Large Object Detection Transformers
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention
Video Motion Graphs
MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration
FlowDPS : Flow-Driven Posterior Sampling for Inverse Problems
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation
CountSE: Soft Exemplar Open-set Object Counting
RAGD: Regional-Aware Diffusion Model for Text-to-Image Generation
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography
MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network
WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions
Anti-Tamper Protection for Unauthorized Individual Image Generation
RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes
FedAGC: Federated Continual Learning with Asymmetric Gradient Correction
Spatial-Temporal Forgery Trace based Forgery Image Identification
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models
OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration
DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars
Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization
MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion
DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization
Efficient Track Anything
Learning Visual Proxy for Compositional Zero-Shot Learning
Street Gaussians without 3D Object Tracker
RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
GVDepth: Zero-shot monocular depth estimation for ground vehicles based on probabilistic cue fusion
ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
Importance-Based Token Merging for Efficient Image and Video Generation
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
Diffusion-Based Imaginative Coordination for Bimanual Manipulation
FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models
6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting
Referring to Any Person
MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models
Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping
ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction
A₀ : An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Understanding Personal Concept in Open-Vocabulary Semantic Segmentation
Unleashing Vectset Diffusion Model for Fast Shape Generation
ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration
Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond
Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking
Make Me Happier: Evoking Emotions Through Image Diffusion Models
ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning
Driving Scene Synthesis on Free-form Trajectories with Generative Prior
FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
ArchiSet: Benchmarking Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
CF3: Compact and Fast 3D Feature Fields
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models
Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing
DCHM: Depth-Consistency Human Modeling for Multiview Detection
Activation Subspaces for Out-of-Distribution Detection
Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion
Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
LVBench: An Extreme Long Video Understanding Benchmark
Dataset Distillation as Data Compression: A Rate-Utility Perspective
General Compression Framework for Efficient Transformer Object Tracking
Learning Robust Image Watermarking with Lossless Cover Recovery
EDiT: Efficient Diffusion Transformers with Linear Compressed Attention
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Multi-scenario Overlapping Text Segmentation with Depth Awareness
Dual-Process Image Generation
PlaneRAS: Learning Planar Primitives for 3D Plane Recovery
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting
Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories
Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration
GENMO: A GENeralist Model for Human MOtion
LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis
ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching
VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving
From Panels to Prose: Generating Literary Narratives from Comics
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
Understanding Co-speech Gestures in-the-wild
Engage for All: Making Ordinary Image Descriptions Appealing Again!
PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement
RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation
LayerLock: Non-collapsing Representation Learning with Progressive Freezing
Scalable Dual Fingerprinting for Hierarchical Attribution of Text-to-Image Models
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
From Abyssal Darkness to Blinding Glare: A Benchmark on Extreme Exposure Correction in Real World
Entropy-Adaptive Diffusion Policy Optimization with Dynamic Step Alignment
Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection
AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion
Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction
Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion
Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent
Discretized Gaussian Representation for Tomographic Reconstruction
IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking
High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach
MAVias: Mitigate any Visual Bias
LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation
Frequency-Dynamic Attention Modulation For Dense Prediction
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation
Generative Modeling of Shape-Dependent Self-Contact Human Poses
4D Visual Pre-training for Robot Learning
Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening
AIComposer: Any Style and Content Image Composition via Feature Integration
Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images
MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance
Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models
AnyI2V: Animating Any Conditional Image with Motion Control
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data
From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning
Cross-Subject Mind Decoding from Inaccurate Representations
SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers
Consistency Trajectory Matching for One-Step Generative Super-Resolution
LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
Pseudo-Interaction: a Hybrid-Tower Paradigm for Text-to-Video Retrieval
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene
CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts
VIPerson: Flexibly Generating Virtual Identity for Person Re-Identification
Mamba-3VL: Taming State Space Model for 3D Vision Language Learning
Scaling 3D Compositional Models for Robust Classification and Pose Estimation
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment
FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation
Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation
Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features
Aether: Geometric-Aware Unified World Modeling
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation
Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives
Sparfels: Fast Reconstruction from Sparse Unposed Imagery
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition
Class Token as Proxy: Optimal Transport-assisted Proxy Learning for Weakly Supervised Semantic Segmentation
FDPT: Federated Discrete Prompt Tuning for Black-Box Visual-Language Models
Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds
Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering
EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing
Probabilistic Point Clouds from Single-Photon LiDARs for Robust 3D Inference
Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images
DONUT: A Decoder-Only Model for Trajectory Prediction
MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild
Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing
Tile-wise vs. Image-wise: Random-Tile Loss and Training Paradigm for Gaussian Splatting
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Coordinates
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction
SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion
UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis
S$^3$E: Self-Supervised State Estimation for Radar-Inertial System
SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
DialNav: Multi-turn Dialog Navigation with a Remote Guide
F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
A Unified Framework to BRIDGE Complete and Incomplete Deep Multi-View Clustering under Non-IID Missing Patterns
Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy
HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos
From Reusing to Forecasting: Accelerating Diffusion Models with Taylor Seers
Emulating Self-attention with Convolution for Efficient Image Super-Resolution
Free-running vs Synchronous: Single-Photon Lidar for High-flux 3D Imaging
Integrating Task-Specific and Universal Adapters for Pre-Trained Model-Based Class-Incremental Learning
From Image to Video: An Empirical Study of Diffusion Representations
DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance
Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
Principles of Visual Tokens for Efficient Video Understanding
PersonaCraft: Personalized Full-body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion
MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation
External Knowledge Injection for CLIP-Based Class-Incremental Learning
GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
Stable Score Distillation
WildSAT: Learning Satellite Image Representations from Wildlife Observations
Amodal Depth Anything: Amodal Depth Estimation in the Wild
Spatio-Temporal Control for Masked Motion Synthesis
SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis
Towards Foundational Models for Single-Chip Radar
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection
The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models
Knowledge Transfer from Interactions Learning
WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image
EAMamba - Efficient All-Around Vision State Space Model for Image Restoration
Event-based Visual Vibrometry
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance
SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis
OmniVTON: Training-Free Universal Virtual Try-On
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data
Enhancing Image Restoration Transformer via Adaptive Translation Equivariance
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
CMB-ML: A Cosmic Microwave Background Dataset for the Oldest Possible Computer Vision Task
Boosting Adversarial Transferability via Residual Perturbation Attack
How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes
FlowStyler: Artistic Video Stylization via Transformation Fields Transports
CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling
Aligning Moments in Time using Video Queries
Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams
LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders
Efficient Spiking Point Mamba for Point Cloud Analysis
RobAVA: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
JPEG Processing Neural Operator for Backward-Compatible Coding
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
Differentially Private Fine-Tuning of Diffusion Models
Auto-Regressively Generating Multi-View Consistent Images
Steering Guidance for Personalized Text-to-Image Diffusion Models
Co-Painter: Fine-Grained Controllable Image Stylization via Implicit Decoupling and Adaptive Injection
ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints
FlexGen: Flexible Multi-View Generation from Text and Image Inputs
ArgoTweak: Towards Self-Updating HD Maps through Structured Priors
Latest Object Memory Management for Temporally Consistent Video Instance Segmentation
RhythmGuassian: Repurposing Generalizable Gaussian Model For Remote Physiological Measurement
Scene Graph Guided Generation: Enable Accurate Relations Generation in Text-to-Image Models via Textural Rectification
FlowTok: Flowing Seamlessly Across Text and Image Tokens
VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions
Learning 4D Embodied World Models
SIMView: Long-term Autoregressive Scene Generation with Surfel-Indexed Memory of Views
Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
Performing Defocus Deblurring by Modeling its Formation Process
From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
MEH: A Multi-Style Dataset and Toolkit for Advancing Egyptian Hieroglyph Recognition
Recovering Parametric Scenes from Very Few Time-of-Flight Pixels
TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models
MOBIUS: Big-to-Mobile Universal Image Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
SA-MAE: A Sensor-Agnostic Masked Autoencoder for Remote Sensing Image Representation Learning
Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization
Federated Continuous Category Discovery and Learning
Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
MistSense: Versatile Online Detection of Procedural and Execution Mistakes
MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers
Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars
EVT: Efficient View Transformation for Multi-Modal 3D Object Detection
M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization
GWM: Towards Scalable Gaussian World Models for Robotic Manipulation
Scene Coordinate Reconstruction Priors
NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation
Adversarial Robustness of Discriminative Self-Supervised Learning in Vision
Skeleton Motion Words for Unsupervised Skeleton-based Temporal Action Segmentation
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description
Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes
MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text.
CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models
OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment
Vision-Language Neural Graph Featurization for Extracting Retinal Lesions
LaCoOT: Layer Collapse through Optimal Transport
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
Controllable Weather Simulation and Removal with Video Diffusion Models
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Hybrid Layout Control for Diffusion Transformer: Fewer Annotations, Superior Aesthetics
A Real-world Display Inverse Rendering Dataset
No More Sibling Rivalry: Debiasing Human-Object Interaction Detection
DeFSS: Image-to-Mask Denoising Learning for Few-shot Segmentation
DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion
Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering
Controlling Multimodal LLMs via Reward-guided Decoding
RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction
TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision
Anomaly Detection of Integrated Circuits Package Substrates Using the Large Vision Model SAIC: Dataset Construction, Methodology, and Application
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection
Diagnosing Pretrained Models for Out-of-distribution Detection
Sibai: A Few-Shot Meta-Classifier for Poisoning Detection in Federated Learning
ShortV: Freezing Visual Tokens in Ineffective Layers of Multimodal Large Language Models
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
PLMP - Point-Line Minimal Problems for Projective SfM
SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning
Deterministic Object Pose Confidence Region Estimation
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts
Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures
HFD-Teacher: High-Frequency Depth Distillation from Depth Foundation Models for Enhanced Depth Completion
MM-IFEngine: Towards Multimodal Instruction Following
ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
Neighboring Autoregressive Modeling for Efficient Visual Generation
ZipVL: Accelerating Vision-Language Models through Dynamic Token Sparsity
GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections
MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning
EditCLIP: Representation Learning for Image Editing
MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
GaussianReg: Rapid 2D/3D Registration for Emergency Surgery via Explicit 3D Modeling with Gaussian Primitives
DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
Synthetic Video Enhances Physical Fidelity in Video Synthesis
UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields
Adversarial Reconstruction Feedback for Robust Fine-grained Generalization
InfoBridge: Balanced Multimodal Integration through Conditional Dependency Modeling
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
PosedVideo365 - A Diverse Dataset with Accurate Camera Pose
FA: Forced prompt leArning of Vision-Language Models for Out-of-Distribution Detection
Multi-modal Multi-platform Person Re-Identification: Benchmark and Method
Continual Personalization for Diffusion Models
Synergistic Prompting for Robust Visual Recognition with Missing Modalities
PLA: Prompt Learning Attack against Text-to-Image Generative Models
TARS: Traffic-Aware Radar Scene Flow Estimation
Mitigating Geometric Degradation in Fast DownSampling via FastAdapter for Point Cloud Segmentation
On Large Multimodal Models as Open-World Image Classifiers
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
Learning Interpretable Queries for Explainable Image Classification with Information Pursuit
SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer
Unlearning the Noisy Correspondence Makes CLIP More Robust
Beyond the Limits: Overcoming Negative Correlation of Activation-Based Training-Free NAS
StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion
Hierarchy-Aware Pseudo Word Learning with Text Adaptation for Zero-Shot Composed Image Retrieval
LMM-Det: Make Large Multimodal Models Excel in Object Detection
Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction
I Am Big, You Are Little; I Am Right, You Are Wrong
MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation
Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold
One Encoder to Rule them All: Representation Learning for Model-free Visual Reinforcement Learning using Fourier Neural Operators
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
SAM4D: Segment Anything in Camera and LiDAR Streams
Decoupled Diffusion Sparks Adaptive Scene Generation
DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate
Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation
TACO: Taming Diffusion for in-the-wild Video Amodal Completion
DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing
EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks
Background Invariance Testing According to Semantic Proximity
CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations
Evading Data Provenance in Deep Neural Networks
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion
HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers
EA-Vit: Efficient Adaptation for Elastic Vision Transformer
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
Semi-supervised Deep Transfer for Regression without Domain Alignment
VideoSetBench: Identifying and Reasoning Similarities and Differences in Similar Videos
DreamRelation: Relation-Centric Video Customization
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
CObL: Toward Zero-Shot Ordinal Layering without User Prompting
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning.
Multi-Object Sketch Animation by Scene Decomposition and Motion Planning
GenHaze: Pioneering Controllable One-Step Realistic Haze Generation for Real-World Dehazing
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching
Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models
Pretrained Reversible Generation as Unsupervised Visual Representation Learning
FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliablity
Sequential Gaussian Avatars with Hierarchical Motion Context
MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP
ST-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Training-Free Industrial Defect Generation with Diffusion Models
Flow Stochastic Segmentation Networks
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
Less is More: Improving Motion Diffusion Models with Sparse Keyframes
CityGS-X : A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting
Variance-Based Pruning for Accelerating and Compressing Trained Networks
Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation
DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF
UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale
T2Bs: Text-to-Character Blendshapes via Video Generation
Scaling Laws for Native Multimodal Models
Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
Synchronization of Multiple Videos in-the-wild
Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification
Inverse Image-Based Rendering for Light Field Generation from Single Images
CompCap: Improving Multimodal Large Language Models with Composite Captions
AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving
RnGCam: High-speed video from rolling & global shutter measurements
VisionMath: Vision-Form Mathematical Problem-Solving
Multi-view Gaze Target Estimation
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Shot by Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Super Resolved Imaging with Adaptive Optics
Partially Matching Submap Helps: Uncetainty Modeling and Propagation for Text to Point Cloud Localization
3D Test-time Adaptation via Graph Spectral Driven Point Shift
Context Guided Transformer Entropy Modeling for Video Compression
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation
AFFECT: Aligning Fisheye Feature Embeddings using Calibration Tokens for Monocular Depth Estimation
Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning
Bringing RNNs Back to Efficient Open-Ended Video Understanding
GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes
Blind Noisy Image Deblurring Using Residual Guidance Strategy
CVPT: Cross Visual Prompt Tuning
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding
Describe Anything: Detailed Localized Image and Video Captioning
ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
Bias in Gender Bias Benchmarks: How Confounding Features Distort Evaluation
Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset
KOEnsAttack: Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles
UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint
Extrapolated Urban View Synthesis Benchmark
Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Unsupervised Visible-Infrared Person Re-identification under Unpaired Settings
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions
ConstStyle: Robust Domain Generalization with Unified Style Transformation
SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models
Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding
Polarimetric Neural Field with Unified Complex-Valued Wavefunction
Representation Shift: Unifying Token Compression with FlashAttention
Teeth Reconstruction and Performance Capture Using a Phone Camera
ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning
TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation
Learning Separable Fine-Grained Representation via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition
Class-Wise Federated Averaging for Efficient Personalization
Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation
SummDiff: Generative Modeling of Video Summarization with Diffusion
Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation
Bi-Level Optimization for Self-Supervised AI-Generated Face Detection
PanSt3R: Multi-view consistent panoptic segmentation
4D Gaussian Splatting SLAM
Task Vector Quantization for Memory-Efficient Model Merging
Low-Light Image Enhancement using Event-Based Illumination Estimation
Edit360: 2D Image Edits to 3D Assets from Any Angle
Estimating 2D Camera Motion with Hybrid Motion Basis
Diffusion-based 3D Hand Motion Recovery with Intuitive Physics
CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy
Generative Zoo
Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
UIPro: Unleashing Superior Interaction Capability For GUI Agents
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
DADet: Safeguarding Image Conditional Diffusion Models against Adversarial and Backdoor Attacks via Diffusion Anomaly Detection
Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
On the Recovery of Cameras from Fundamental Matrices
Human-Object Interaction from Human-Level Instructions
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
IntrinsicControlNet: Cross-distribution Image Generation with Real and Unreal
ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
SynCity: Training-Free Generation of 3D Cities
Multispectral Demosaicing via Dual Cameras
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
CoralSRT: Revisiting Coral Reef Semantic Segmentation by Feature Rectifying via Self-supervised Guidance
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
MCOP: Multi-UAV Collaborative Occupancy Prediction
An OpenMind for 3D medical vision self-supervised learning
SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
BATCLIP: Bimodal Online Test-Time Adaptation for CLIP
IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimzation
Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos
NaviDet: Efficient Input-level Backdoor Detection on Text-to-Image Synthesis via Neuron Activation Variation
ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
ArtFlow: Bridging Artworks Through Time With Flow
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks
Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
Cooperative Pseudo Labeling for Unsupervised Federated Classification
Boosting Class Representation via Semantically Related Instances for Robust Long-Tailed Learning with Noisy Labels
Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos
Learning an Implicit Physics Model for Image-based Fluid Simulation
Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration
Denoising Token Prediction in Masked Autoregressive Models
SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation
Aligning Global Semantics and Local Textures in Generative Video Enhancement
$\pi$-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis?
DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Consensus-Driven Active Model Selection
Teleportraits: Training-Free People Insertion into Any Scene
PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations
CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization
Voyaging into Unbounded Dynamic Scenes from a Single View
DyGS-SLAM: Real-Time Accurate Localization and Gaussian Reconstruction for Dynamic Scenes
VACE: All-in-One Video Creation and Editing
RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text
LACONIC: A 3D Layout Adapter for Controllable Image Creation
$\chi$: Symmetry Understanding of 3D Shapes via Chirality Disentanglement
Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing
UDC-VIX: A Real-World Video Dataset for Under-Display Cameras
InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Piror
Knowledge Distillation for Learned Image Compression
SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency
Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models
Social Debiasing for Fair Multi-modal LLMs
GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors
CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks
Sliced Wasserstein Bridge for Open-Vocabulary Video Instance Segmentation
MagicColor: Multi-instance Sketch Colorization
MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting
COVTrack: Continuous Open-Vocabulary Multi-Object Tracking via Adaptive Multi-Cue Fusion
TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal In-Context Learning In Text-to-Image Models
CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation
VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking
ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis
Learnable Logit Adjustment for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch
VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
UnZipLoRA: Separating Content and Style from a Single Image
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
DriveX: Panoptic Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
Tree Skeletonization from 3D Point Clouds by Denoising Diffusion
COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition
JailbreakDiffBench: A Comprehensive Benchmark for Jailbreaking Diffusion Models
Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning
Outlier-Aware Post-Training Quantization for Image Super-Resolution
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models
Dense Policy: Bidirectional Autoregressive Learning of Actions
SplatTalk: 3D VQA with Gaussian Splatting
Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models
Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction
Boosting Dynamic Prototyping via Dual-Knowledge Clustering for Semi-Supervised Lifelong Person Re-Identification
GloPER: Unsupervised Animal Pattern Extraction from Local Reconstruction
LookOut: Real-World Humanoid Egocentric Navigation
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
SuperDec: 3D Scene Decomposition with Superquadrics Primitives
When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training
Unsupervised Histopathological Image Semantic Segmentation with Overlapping Patches Consistency Constraint
MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps
Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning
Harnessing Input-adaptive Inference for Efficient VLN
Dual-S3D: Hierarchical Dual-Path Selective SSM-CNN for High-Fidelity Implicit Reconstruction
Geometry Distributions
EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba
Removing Cost Volumes from Optical Flow Estimators
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video
Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion
Dataset Distillation via Vision-Language Category Prototype
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector
Simultaneous Motion And Noise Estimation with Event Cameras
Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recogntion
Generating Physically Stable and Buildable LEGO Designs from Text
Is Visual in-Context Learning for Compositional Medical Tasks within Reach?
EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection
Towards Physically Plausible Video Generation via VLM Planning
DiMPLe - Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation
OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding
Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution
Predict, Optimize, Distill: A Self-Improving Cycle for 4D Object Understanding
Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels
Improving Rectified Flow with Boundary Conditions
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction
On the Generalization of Representation Uncertainty in Earth Observation
Adversarial Robust Memory-Based Continual Learner
CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting
FusionPhys: A Flexible Framework for Fusing Complementary Sensing Modalities in Remote Physiological Measurement
LEGO-Maker: A Semantic-Driven Algorithm for Text-to-3D Generation
AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation
RoboPearls: Editable Video Simulation for Robot Manipulation
NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
Fuzzy Contrastive Decoding to Alleviate Object Hallucination in Large Vision-Language Models
PlugMark: A Plug-in Zero-Watermarking Framework for Diffusion Models
Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack
Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models
Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Beyond RGB: Adaptive Parallel Processing for RAW Object Detection
Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Information Density Principle for MLLM Benchmarks
Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection
DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
MikuDance: Animating Character Art with Mixed Motion Dynamics
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models
NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Predictions
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
X-Dancer: Expressive Music to Human Dance Video Generation
Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs
HQCLIP: Leveraging Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments
HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models
Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image
PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
AllTracker: Efficient Dense Point Tracking at High Resolution
AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
Semantic versus Identity: A Divide-and-Conquer Approach towards Adjustable Medical Image De-Identification
ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization
FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image
Dynamic Multimodal Prototype Learning in Vision-Language Models
Looking in the mirror: A faithful counterfactual explanation method for interpreting deep image classification models
One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution
Task-Decoupled Bézier Surface Constraint for Uneven Low-Light Image Enhancement
RoMo: Robust Motion Segmentation Improves Structure from Motion
UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents
SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
Evidential Knowledge Distillation
SViM3D: Stable Video Material Diffusion for Single Image 3D Generation
HAMSt3R: Human Aware Multi-view Stereo 3D Reconstruction
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Revisiting Point Cloud Completion: Are We Ready For The Real-World?
On the Provable Importance of Gradients for Language-Assisted Image Clustering
VMBench: A Benchmark for Perception-Aligned Video Motion Generation
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
ViLLa: Video Reasoning Segmentation with Large Language Model
CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition
Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model
DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting
AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation
VideoAuteur: Towards Long Narrative Video Generation - A case study in How-to-Cook Videos
Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective
QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing
Controllable 3D Outdoor Scene Generation via Scene Graphs
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Cross-Architecture Distillation Made Simple with Redundancy Suppression
Stylized-Face: A Million-level Stylized Face Dataset for Face Recognition
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer
Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
Scaling and Taming Adversarial Training with Synthetic Data
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Find Any Part in 3D
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
Future-Aware Interaction Network For Motion Forecasting
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
A Recurrence Prior for Object Insertion and Subject-Driven Generation
PVChat: Personalized Video Chat with One-Shot Learning
Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
Two Losses, One Goal: Aligning Conflict Gradients for Semi-supervised Semantic Segmentation
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
MaskSAM: Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation
SMGDiff: Soccer Motion Generation using diffusion probabilistic models
TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos
RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians
DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
NETracer: A Topology-Aware Iterative Tracing Approach for Tubular Structure Extraction
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing
Feature Decomposition-Recomposition in Large Vision-Language Model for Few-Shot Class-Incremental Learning
Benchmarking Multimodal Large Language Models Against Image Corruptions
Towards a Unified Copernicus Foundation Model for Earth Vision
RadGPT: Constructing 3D Image-Text Tumor Datasets
Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree