ICCV 2025 Events with Videos
For Tutorials and Workshops, it is up to the organizer to record their event. Many Tutorial and Workshops choose not to record.
Keynotes
Meetings
Orals
- Oral 1A: Multi-modal learning
- Oral 1B: Structure and Motion
- Oral 2A: View Synthesis and Scene Reconstruction
- Oral 2B: Efficient Learning
- Oral 3A: Foundation models and representation learning
- Oral 3B: Human Modeling
- Oral 4A: Vision + graphics
- Oral 4B: 3D Pose Understanding
- Oral 5A: Content Generation
- Oral 5B: Applications and evaluation
- Oral 6A: Physical Scene Perception
- Oral 6B: Segmentation and grouping
Posters
- Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization
- Scaling and Taming Adversarial Training with Synthetic Data
- Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness
- Token Activation Map to Visually Explain Multimodal LLMs
- LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding
- A Tiny Change, A Giant Leap: Long-Tailed Class-Incremental Learning via Geometric Prototype Alignment
- From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
- FREE-Merging: Fourier Transform for Efficient Model Merging
- RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction
- Understanding Flatness in Generative Models: Its Role and Benefits
- Lark: Low-Rank Updates After Knowledge Localization for Few-shot Class-Incremental Learning
- Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
- PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening
- ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models
- FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift
- Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers
- Diagnosing Pretrained Models for Out-of-distribution Detection
- SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers
- VALLR: Visual ASR Language Model for Lip Reading
- MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
- On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations
- Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
- DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning
- AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
- MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models
- Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints
- Adversarial Training for Probabilistic Robustness
- A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention
- VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow
- Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
- KOEnsAttack: Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles
- PEFTDiff: Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning
- VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
- Multi-View 3D Point Tracking
- Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection
- Spatial Preference Rewarding for MLLMs Spatial Understanding
- Gradient Extrapolation for Debiased Representation Learning
- Effective Training Data Synthesis for Improving MLLM Chart Understanding
- ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers
- Learnable Logit Adjustment for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch
- Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation
- Moderating the Generalization of Score-based Generative Model
- Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning
- Activation Subspaces for Out-of-Distribution Detection
- TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
- BATCLIP: Bimodal Online Test-Time Adaptation for CLIP
- Hypergraph Clustering Network with Partial Attribute Imputation
- Boosting Adversarial Transferability via Residual Perturbation Attack
- Open-set Cross Modal Generalization via Multimodal Unified Representation
- FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization
- EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients
- What to Distill? Fast Knowledge Distillation with Adaptive Sampling
- AIRA: Activation-Informed Low-Rank Adaptation for Large Models
- IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
- Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection
- PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection
- FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging
- Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
- Federated Representation Angle Learning
- Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning
- Learning to Inference Adaptively for Multimodal Large Language Models
- Diffusion Guided Adaptive Augmentation for Generalization in Visual Reinforcement Learning
- Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark
- Dataset Distillation via Vision-Language Category Prototype
- Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment
- Evidential Knowledge Distillation
- SHIFT: Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models
- Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification
- 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
- A Conditional Probability Framework for Compositional Zero-shot Learning
- Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
- SMP-Attack: Boosting the Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout
- Understanding Museum Exhibits using Vision-Language Reasoning
- DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
- Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning
- DOGR: Towards Versatile Visual Document Grounding and Referring
- CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective
- Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy
- Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
- Visual Intention Grounding for Egocentric Assistants
- Tensor-aggregated LoRA in Federated Fine-tuning
- Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning
- SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
- R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
- Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning
- Seal Your Backdoor with Variational Defense
- Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning
- MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
- BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
- Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
- DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
- Semi-supervised Concept Bottleneck Models
- Robust Dataset Condensation using Supervised Contrastive Learning
- MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
- Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
- TRNAS: A Training-Free Robust Neural Architecture Search
- Knowledge Distillation with Refined Logits
- Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection
- DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion
- SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation
- Cooperative Pseudo Labeling for Unsupervised Federated Classification
- RANKCLIP: Ranking-Consistent Language-Image Pretraining
- Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding
- Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
- INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
- LIRA: Reasoning Reconstruction via Multimodal Large Language Models
- Joint Asymmetric Loss for Learning with Noisy Labels
- Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration
- DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
- Enhancing Numerical Prediction of MLLMs with Soft Labeling
- FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
- Backdoor Defense via Enhanced Splitting and Trap Isolation
- Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations
- ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
- Controlling Multimodal LLMs via Reward-guided Decoding
- From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
- Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
- Mitigating Catastrophic Overfitting in Fast Adversarial Training via Label Information Elimination
- LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement
- GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
- Removing Cost Volumes from Optical Flow Estimators
- Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning
- FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection
- Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
- Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
- Class-Wise Federated Averaging for Efficient Personalization
- Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths
- CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training
- Can Knowledge be Transferred from Unimodal to Multimodal? Investigating the Transitivity of Multimodal Knowledge Editing
- FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization
- Boundary Probing for Input Privacy Protection When Using LMM Services
- Flexi-FSCIL: Adaptive Knowledge Retention for Breaking the Stability-Plasticity Dilemma in Few-Shot Class-Incremental Learning
- TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
- Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting
- One Encoder to Rule them All: Representation Learning for Model-free Visual Reinforcement Learning using Fourier Neural Operators
- Boosting Adversarial Transferability via Negative Hessian Trace Regularization
- NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
- MMOne: Representing Multiple Modalities in One Scene
- A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
- COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition
- GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
- I Am Big, You Are Little; I Am Right, You Are Wrong
- Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
- SplatTalk: 3D VQA with Gaussian Splatting
- VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
- DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection
- AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
- SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking
- ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
- ConstStyle: Robust Domain Generalization with Unified Style Transformation
- Visual Modality Prompt for Adapting Vision-Language Object Detectors
- VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders
- G2D: Boosting Multimodal Learning with Gradient-Guided Distillation
- Is Visual in-Context Learning for Compositional Medical Tasks within Reach?
- Diversity-Enhanced Distribution Alignment for Dataset Distillation
- LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
- Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models
- Online Dense Point Tracking with Streaming Memory
- Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing
- SAC-GNC: SAmple Consensus for adaptive Graduated Non-Convexity
- GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene
- PHD: Personalized 3D Human Body Fitting with Point Diffusion
- Multi-view Gaze Target Estimation
- St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
- UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI
- Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification
- DAViD: Data-efficient and Accurate Vision Models from Synthetic Data
- CityNav: A Large-Scale Dataset for Real-World Aerial Navigation
- AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
- Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
- PlaneRAS: Learning Planar Primitives for 3D Plane Recovery
- TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions
- 4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
- Optical Model-Driven Sharpness Mapping for Autofocus in Small Depth-of-Field and Severe Defocus Scenarios
- CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy
- Aether: Geometric-Aware Unified World Modeling
- Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs
- Combinative Matching for Geometric Shape Assembly
- PBFG: A New Physically-Based Dataset and Removal of Lens Flares and Glares
- Beyond RGB: Adaptive Parallel Processing for RAW Object Detection
- Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation
- Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes
- UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
- Princeton365: A Diverse Dataset with Accurate Camera Pose
- Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection
- Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime
- Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
- MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
- ArgoTweak: Towards Self-Updating HD Maps through Structured Priors
- Multispectral Demosaicing via Dual Cameras
- A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition
- M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision
- Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge
- MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos
- Voyaging into Perpetual Dynamic Scenes from a Single View
- Variance-Based Pruning for Accelerating and Compressing Trained Networks
- DCHM: Depth-Consistent Human Modeling for Multiview Detection
- PLMP - Point-Line Minimal Problems for Projective SfM
- Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
- PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
- Find Any Part in 3D
- NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
- STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
- Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition
- INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
- ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
- Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
- GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields
- When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
- Generative Zoo
- PanSt3R: Multi-view Consistent Panoptic Segmentation
- Harnessing Input-Adaptive Inference for Efficient VLN
- Quanta Neural Networks: From Photons to Perception
- MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
- BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
- PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
- DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation
- RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction
- Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels
- PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement
- Future-Aware Interaction Network For Motion Forecasting
- Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features
- CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image
- MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation
- Training-Free Personalization via Retrieval and Reasoning on Fingerprints
- WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images
- Weakly-Supervised Learning of Dense Functional Correspondences
- Unsupervised Identification of Protein Compositions and Conformations via Implicit Content-Transformation Disentanglement
- VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
- SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
- Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
- TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
- CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
- From Abyssal Darkness to Blinding Glare: A Benchmark on Extreme Exposure Correction in Real World
- Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection
- CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
- Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis
- Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm
- PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image
- Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization
- 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
- OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance
- LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
- VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
- AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning
- MeasureXpert: Automatic Anthropometric Measurement Extraction from Two Unregistered, Partial, Posed, and Dressed Body Scans
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
- VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions
- Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor
- Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
- PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
- Manual-PA: Learning 3D Part Assembly from Instruction Diagrams
- Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras
- Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
- Where am I? Cross-View Geo-localization with Natural Language Descriptions
- EventUPS: Uncalibrated Photometric Stereo Using an Event Camera
- Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation
- Learning 3D Scene Analogies with Neural Contextual Scene Maps
- WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions
- GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
- AgroBench: Vision-Language Model Benchmark in Agriculture
- VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting
- Zero-shot Inexact CAD Model Alignment from a Single Image
- C4D: 4D Made from 3D through Dual Correspondences
- SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
- RoMo: Robust Motion Segmentation Improves Structure from Motion
- EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching
- AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
- A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba
- Selective Contrastive Learning for Weakly Supervised Affordance Grounding
- Test-Time Retrieval-Augmented Adaptation for Vision-Language Models
- MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
- EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision
- Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
- CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
- GSOT3D: Towards Generic 3D Single Object Tracking in the Wild
- BlinkTrack: Feature Tracking over 80 FPS via Events and Images
- Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
- From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
- VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition
- ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
- HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
- Time-Aware Auto White Balance in Mobile Photography
- OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
- PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
- InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
- PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing
- Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation
- Language Driven Occupancy Prediction
- Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
- Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry
- Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
- EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks
- Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
- Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
- Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics
- PseudoMapTrainer: Learning Online Mapping without HD Maps
- Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
- MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting
- OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection
- PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
- Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
- Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
- WildSAT: Learning Satellite Image Representations from Wildlife Observations
- Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
- UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
- Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
- High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
- On the Generalization of Representation Uncertainty in Earth Observation
- Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
- egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks
- Met2Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems
- MVGBench: a Comprehensive Benchmark for Multi-view Generation Models
- MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
- O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
- Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation
- Is Tracking really more challenging in First Person Egocentric Vision?
- Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
- GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
- Task-Decoupled Bézier Surface Constraint for Uneven Low-Light Image Enhancement
- Robust Low-light Scene Restoration via Illumination Transition
- SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
- IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
- EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
- Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function
- Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
- VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
- CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector
- Heavy Labels Out! Dataset Distillation with Label Space Lightening
- FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
- Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
- Open-Vocabulary Octree-Graph for 3D Scene Understanding
- GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
- Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery
- Background Invariance Testing According to Semantic Proximity
- Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection
- Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
- DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
- Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion
- RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
- Balanced Sharpness-Aware Minimization for Imbalanced Regression
- ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
- AnimalClue: Recognizing Animals by their Traces
- ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy
- VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
- DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
- Understanding Co-speech Gestures in-the-wild
- LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
- Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation
- Expressive Talking Human from Single-Image with Imperfect Priors
- Controllable Weather Synthesis and Removal with Video Diffusion Models
- ZFusion: Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior
- RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model
- CarGait: Cross-Attention based Re-ranking for Gait recognition
- Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
- EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment
- FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling
- Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising
- FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching
- PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement
- FreeDance: Towards Harmonic Free-Number Group Dance Generation via a Unified Framework
- What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
- PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
- Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
- UDC-VIT: A Real-World Video Dataset for Under-Display Cameras
- KinMo: Kinematic-aware Human Motion Understanding and Generation
- Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model
- EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
- Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution
- VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction
- FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
- UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments
- UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
- LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
- Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
- MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
- LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
- Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors
- FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
- Context-Aware Academic Emotion Dataset and Benchmark
- Multi-modal Identity Extraction
- MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion
- Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars
- GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting
- SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis
- MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
- SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation
- Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
- Towards a Unified Copernicus Foundation Model for Earth Vision
- MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
- AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation
- OneGT: One-Shot Geometry-Texture Neural Rendering for Head Avatars
- Consistency Trajectory Matching for One-Step Generative Super-Resolution
- MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration
- MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
- VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
- SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing
- ChartCap: Mitigating Hallucination of Dense Chart Captioning
- AV-Flow: Transforming Text to Audio-Visual Human-like Interactions
- EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
- LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association
- PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
- SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
- Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
- F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration
- AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
- Multi-modal Multi-platform Person Re-Identification: Benchmark and Method
- Learning Hierarchical Line Buffer for Image Processing
- EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow
- G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
- Blind Video Super-Resolution based on Implicit Kernels
- CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation
- DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
- Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation
- SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting
- DexVLG: Dexterous Vision-Language-Grasp Model at Scale
- LayerAnimate: Layer-level Control for Animation
- SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
- DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
- GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars
- GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
- GMMamba: Group Masking Mamba for Whole Slide Image Classification
- MOVE: Motion-Guided Few-Shot Video Object Segmentation
- DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
- Separation for Better Integration: Disentangling Edge and Motion in Event-based Deblurring
- PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution
- DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing
- Autoregressive Denoising Score Matching is a Good Video Anomaly Detector
- VSRM: A Robust Mamba-Based Framework for Video Super-Resolution
- DIMO: Diverse 3D Motion Generation for Arbitrary Objects
- AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation
- SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
- GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule
- SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models
- Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
- Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition
- Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking
- iManip: Skill-Incremental Learning for Robotic Manipulation
- Towards a Universal Image Degradation Model via Content-Degradation Disentanglement
- V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video
- Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection
- RoboPearls: Editable Video Simulation for Robot Manipulation
- Efficient Concertormer for Image Deblurring and Beyond
- Augmented Mass-Spring Model for Real-Time Dense Hair Simulation
- FlowStyler: Artistic Video Stylization via Transformation Fields Transports
- MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation
- DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
- AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm
- DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads
- DisenQ: Disentangling Q-Former for Activity-Biometrics
- Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior
- UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
- DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
- MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
- Identity Preserving 3D Head Stylization with Multiview Score Distillation
- Joint Self-Supervised Video Alignment and Action Segmentation
- CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
- Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection
- MistSense: Versatile Online Detection of Procedural and Execution Mistakes
- Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars
- Blind Noisy Image Deblurring Using Residual Guidance Strategy
- PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
- PersonaCraft: Personalized and Controllable Full-Body Multi-Human Scene Generation Using Occlusion-Aware 3D-Conditioned Diffusion
- Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
- 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
- Reverse Convolution and Its Applications to Image Restoration
- Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables
- Exploiting Diffusion Prior for Task-driven Image Restoration
- Learning Efficient and Generalizable Human Representation with Human Gaussian Model
- Capturing head avatar with hand contacts from a monocular video
- Skeleton Motion Words for Unsupervised Skeleton-based Temporal Action Segmentation
- Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion
- DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
- Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning
- DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
- Human-Object Interaction from Human-Level Instructions
- Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
- MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence
- Multi-Object Sketch Animation by Scene Decomposition and Motion Planning
- Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
- The Source Image is the Best Attention for Infrared and Visible Image Fusion
- HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation
- Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation
- AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
- GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
- Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions
- Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
- GUAVA: Generalizable Upper Body 3D Gaussian Avatar
- Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition
- General Compression Framework for Efficient Transformer Object Tracking
- ModSkill: Physical Character Skill Modularization
- MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
- Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training
- InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
- IDFace: Face Template Protection for Efficient and Secure Identification
- Generic Event Boundary Detection via Denoising Diffusion
- Efficient Track Anything
- Q-Norm: Robust Representation Learning via Quality-Adaptive Normalization
- StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors
- UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control
- RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
- Sequential Gaussian Avatars with Hierarchical Motion Context
- EAMamba: Efficient All-Around Vision State Space Model for Image Restoration
- DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover
- Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration
- SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition
- SHeaP: Self-supervised Head Geometry Predictor Learned via 2D Gaussians
- IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution
- DreamRelation: Relation-Centric Video Customization
- Learning Streaming Video Representation via Multitask Training
- Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis
- Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
- FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads
- PrimHOI: Compositional Human-Object Interaction via Reusable Primitives
- NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping
- TeRA: Rethinking Text-guided Realistic 3D Avatar Generation
- InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation
- Face Retouching with Diffusion Data Generation and Spectral Restorement
- Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement
- Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition
- Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
- Text Embedding Knows How to Quantize Text-Guided Diffusion Models
- FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
- MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment
- TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
- DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
- LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching
- CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models
- Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping
- From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
- From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
- Wasserstein Style Distribution Analysis and Transform for Stylized Image Generation
- Certifiably Optimal Anisotropic Rotation Averaging
- LACONIC: A 3D Layout Adapter for Controllable Image Creation
- SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting
- The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation
- Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
- DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
- SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
- Streamlining Image Editing with Layered Diffusion Brushes
- FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
- Split-and-Combine: Enhancing Style Augmentation for Single Domain Generalization
- Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
- Golden Noise for Diffusion Models: A Learning Framework
- CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation
- Spectral Image Tokenizer
- Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal
- Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
- ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
- Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model
- PlugMark: A Plug-in Zero-Watermarking Framework for Diffusion Models
- GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
- Outlier-Aware Post-Training Quantization for Image Super-Resolution
- Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy
- SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
- LUSD: Localized Update Score Distillation for Text-Guided Image Editing
- IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
- Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis
- V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
- Anti-Tamper Protection for Unauthorized Individual Image Generation
- AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
- Training-free Geometric Image Editing on Diffusion Models
- DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization
- TokensGen: Harnessing Condensed Tokens for Long Video Generation
- DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
- PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
- DLF: Extreme Image Compression with Dual-generative Latent Fusion
- FiVE-Bench: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models
- UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint
- Spatial-Temporal Forgery Trace based Forgery Image Identification
- IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models
- LEGION: Learning to Ground and Explain for Synthetic Image Detection
- Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
- Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
- VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
- Rethink Sparse Signals for Pose-guided Text-to-image Generation
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
- DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
- Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention
- PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
- DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization
- DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images
- Pretrained Reversible Generation as Unsupervised Visual Representation Learning
- Denoising Token Prediction in Masked Autoregressive Models
- Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
- Beyond Perspective: Neural 360-Degree Video Compression
- Progressive Artwork Outpainting via Latent Diffusion Models
- M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization
- Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts
- DIVE: Taming DINO for Subject-Driven Video Editing
- Text2Outfit: Controllable Outfit Generation with Multimodal Language Models
- GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
- VACE: All-in-One Video Creation and Editing
- LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
- Accelerating Diffusion Transformer via Gradient-Optimized Cache
- DDB: Diffusion Driven Balancing to Address Spurious Correlations
- Preserve Anything: Controllable Image Synthesis with Object Preservation
- Addressing Text Embedding Leakage in Diffusion-based Image Editing
- DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
- Semantic Discrepancy-aware Detector for Image Forgery Identification
- Adaptive Caching for Faster Video Generation with Diffusion Transformers
- Fine-Tuning Visual Autogressive Models for Subject-Driven Generation
- TryOn-Refiner: Conditional Rectified-flow-based TryOn Refiner for More Accurate Detail Reconstruction
- Generative Video Bi-flow
- CompleteMe: Reference-based Human Image Completion
- UnZipLoRA: Separating Content and Style from a Single Image
- MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
- IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation
- ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
- Beyond Blur: A Fluid Perspective on Generative Diffusion Models
- Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
- Frequency-Guided Diffusion for Training-Free Text-Driven Image Translation
- StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance
- Edicho: Consistent Image Editing in the Wild
- Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer
- Versatile Transition Generation with Image-to-Video Diffusion
- Neighboring Autoregressive Modeling for Efficient Visual Generation
- Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis
- LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
- Textured 3D Regenerative Morphing with 3D Diffusion Prior
- AnyPortal: Zero-Shot Consistent Video Background Replacement
- Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing
- Hybrid Layout Control for Diffusion Transformer: Fewer Annotations, Superior Aesthetics
- ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
- Mobile Video Diffusion
- Learning Robust Image Watermarking with Lossless Cover Recovery
- SummDiff: Generative Modeling of Video Summarization with Diffusion
- SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
- Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
- LEGO-Maker: A Semantic-Driven Algorithm for Text-to-3D Generation
- Magic Insert: Style-Aware Drag-and-Drop
- PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
- SDMatte: Grafting Diffusion Models for Interactive Matting
- TCFG: Truncated Classifier-Free Guidance for Efficient and Scalable Text-to-Image Acceleration
- Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models
- Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation
- On Large Multimodal Models as Open-World Image Classifiers
- DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models
- Zero-Shot Depth Aware Image Editing with Diffusion Models
- FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
- CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
- Region-Level Data Attribution for Text-to-Image Generative Models
- PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask
- Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection
- Scalable Image Tokenization with Index Backpropagation Quantization
- DiffIP: Representation Fingerprints for Robust IP Protection of Diffusion Models
- Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
- Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation
- Generative Adversarial Diffusion
- GReg: Geometry-Aware Region Refinement for Sign Language Video Generation
- Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
- Deterministic Object Pose Confidence Region Estimation
- Multi-turn Consistent Image Editing
- DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing
- TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
- ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection
- LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization
- VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation
- Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
- CAP: Evaluation of Persuasive and Creative Image Generation
- VSC: Visual Search Compositional Text-to-Image Diffusion Model
- ART: Adaptive Relation Tuning for Generalized Relation Prediction
- Memory-Efficient Generative Models via Product Quantization
- OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
- Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement
- FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process
- EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model
- Instruction-based Image Editing with Planning, Reasoning, and Generation
- Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting
- Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
- TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models
- Make Me Happier: Evoking Emotions Through Image Diffusion Models
- DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models
- Continual Personalization for Diffusion Models
- Blended Point Cloud Diffusion for Localized Text-guided Shape Editing
- Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation
- ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba
- B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
- COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation
- ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detection
- Adapt Foundational Segmentation Models with Heterogeneous Searching Space
- Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
- Robustifying Zero-Shot Vision Language Models by Subspaces Alignment
- Unified Open-World Segmentation with Multi-Modal Prompts
- Counting Stacked Objects
- Aligning Moments in Time using Video Queries
- Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection
- ViLLa: Video Reasoning Segmentation with Large Language Model
- Everything is a Video: Unifying Modalities through Next-Frame Prediction
- Superpowering Open-Vocabulary Object Detectors for X-ray Vision
- ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
- Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA
- Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels
- CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
- Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation
- Rectifying Magnitude Neglect in Linear Attention
- Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
- Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation
- Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction
- When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
- Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D
- WeaveSeg: Iterative Contrast-weaving and Spectral Feature-refining for Nuclei Instance Segmentation
- AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
- Factorized Learning for Temporally Grounded Video-Language Models
- Incremental Few-Shot Semantic Segmentation via Multi-Level Switchable Visual Prompts
- CountSE: Soft Exemplar Open-set Object Counting
- Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation
- MINERVA: Evaluating Complex Video Reasoning
- Fuzzy Contrastive Decoding to Alleviate Object Hallucination in Large Vision-Language Models
- CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
- D-Attn: Decomposed Attention for Large Vision-and-Language Model
- ProbMED: A Probabilistic Framework for Medical Multimodal Binding
- LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation
- Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation
- VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
- MEH: A Multi-Style Dataset and Toolkit for Advancing Egyptian Hieroglyph Recognition
- YOLOE: Real-Time Seeing Anything
- ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
- Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
- PVChat: Personalized Video Chat with One-Shot Learning
- Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
- Emulating Self-attention with Convolution for Efficient Image Super-Resolution
- Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
- C2MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
- METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
- MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning
- RA-BUSSeg: Relation-aware Semi-supervised Breast Ultrasound Image Segmentation via Adjacent Propagation and Cross-layer Alignment
- AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images
- Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
- Multi-modal Segment Anything Model for Camouflaged Scene Segmentation
- Memory-Efficient 4-bit Preconditioned Stochastic Optimization
- Temperature in Cosine-based Softmax Loss
- Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- Large-scale Pre-training for Grounded Video Caption Generation
- Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator
- NETracer: A Topology-Aware Iterative Tracing Approach for Tubular Structure Extraction
- When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection
- UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
- Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
- Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
- Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
- LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
- Auto-Vocabulary Semantic Segmentation
- Efficient Fine-Tuning of Large Models via Nested Low-Rank Adaptation
- Advancing Visual Large Language Model for Multi-granular Versatile Perception
- STDDNet: Harnessing Mamba for Video Polyp Segmentation via Spatial-aligned Temporal Modeling and Discriminative Dynamic Representation Learning
- Ensemble Foreground Management for Unsupervised Object Discovery
- Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
- Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
- Task Vector Quantization for Memory-Efficient Model Merging
- Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations
- Vision-Language Neural Graph Featurization for Extracting Retinal Lesions
- Is CLIP ideal? No. Can we fix it? Yes!
- Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching
- Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking
- Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
- LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
- GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
- Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
- G2PDiffusion: Cross-species Genotype-to-Phenotype Prediction via Evolutionary Diffusion
- Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
- SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting
- MRGen: Segmentation Data Engine For Underrepresented MRI Modalities
- HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss
- Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
- Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation
- Few-Shot Pattern Detection via Template Matching and Regression
- Music Grounding by Short Video
- Agreement aware and dissimilarity oriented GLOM
- Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in Medical Image Segmentation with Learnable Prior
- Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
- Zero-Shot Compositional Video Learning with Coding Rate Reduction
- Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration
- CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation
- SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
- CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition
- MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
- MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
- RadGPT: Constructing 3D Image-Text Tumor Datasets
- TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
- LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering
- Scheduling Weight Transitions for Quantization-Aware Training
- Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching
- Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
- Growing a Twig to Accelerate Large Vision-Language Models
- SignRep: Enhancing Self-Supervised Sign Representations
- AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model
- Modeling Saliency Dataset Bias
- LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
- Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis
- CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization
- Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization
- MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval
- MSQ: Memory-Efficient Bit Sparsification Quantization
- SIC: Similarity-Based Interpretable Image Classification with Neural Networks
- Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations
- Training-Free Industrial Defect Generation with Diffusion Models
- The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning
- How Can Objects Help Video-Language Understanding?
- Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement
- CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation
- TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
- Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts
- OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning
- HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos
- ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
- ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers
- Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
- Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
- ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches
- CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
- MaskSAM: Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation
- Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation
- Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning
- PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
- End-to-End Multi-Modal Diffusion Mamba
- HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
- Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation
- Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training
- CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts
- UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation
- ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
- Soft Local Completeness: Rethinking Completeness in XAI
- Visual Textualization for Image Prompted Object Detection
- ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis
- LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
- GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- MSA2: Multi-task Framework with Structure-aware and Style-adaptive Character Representation for Open-set Chinese Text Recognition
- Object-centric Video Question Answering with Visual Grounding and Referring
- Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
- FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
- Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints
- VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions
- Splat-based 3D Scene Reconstruction with Extreme Motion-blur
- Inverse 3D Microscopy Rendering for Cell Shape Inference with Active Mesh
- GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting
- MiDSummer: Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation
- Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction
- SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting
- Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging
- PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors
- Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance
- Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating
- Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns
- 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
- MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching
- WonderTurbo: Generating Interactive 3D World in 0.72 Seconds
- FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
- RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians
- M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking
- CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
- MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances
- Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
- CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
- GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors
- RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction
- Correspondence-Free Fast and Robust Spherical Point Pattern Registration
- DONUT: A Decoder-Only Model for Trajectory Prediction
- SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation
- Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues
- GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion
- Global Regulation and Excitation via Attention Tuning for Stereo Matching
- DAA*: Deep Angular A Star for Image-based Path Planning
- Decoupled Diffusion Sparks Adaptive Scene Generation
- Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
- AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
- OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering
- HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
- UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images
- SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
- EVT: Efficient View Transformation for Multi-Modal 3D Object Detection
- Robust Unfolding Network for HDR Imaging with Modulo Cameras
- CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds
- Scene Coordinate Reconstruction Priors
- You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
- EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
- Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning
- ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery
- Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis
- MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency
- WIPES: Wavelet-based Visual Primitives
- RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
- Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
- SP2T: Sparse Proxy Attention for Dual-stream Point Transformer
- SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
- χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement
- Inverse Image-Based Rendering for Light Field Generation from Single Images
- UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
- Lifting the Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling
- GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
- Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
- Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View
- Towards Safer and Understandable Driver Intention Prediction
- Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array
- Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds
- NeuFrameQ: Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation
- Monocular Semantic Scene Completion via Masked Recurrent Networks
- AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion
- Epona: Autoregressive Diffusion World Model for Autonomous Driving
- RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters
- Discontinuity-aware Normal Integration for Generic Central Camera Models
- Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes
- Towards Foundational Models for Single-Chip Radar
- BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis
- Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes
- PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model
- Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
- ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
- Heatmap Regression without Soft-Argmax for Facial Landmark Detection
- NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement
- Controllable 3D Outdoor Scene Generation via Scene Graphs
- Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography
- 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
- Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion
- Explaining Human Preferences via Metrics for Structured 3D Reconstruction
- Axis-level Symmetry Detection with Group-Equivariant Representation
- NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals
- Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions
- Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation
- Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
- VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data
- IM360: Large-scale Indoor Mapping with 360 Cameras
- Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves
- HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network
- S$^3$E: Self-Supervised State Estimation for Radar-Inertial System
- GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization
- Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing
- Global-Aware Monocular Semantic Scene Completion with State Space Models
- EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images
- FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging
- Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving
- SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
- MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments
- Recovering Parametric Scenes from Very Few Time-of-Flight Pixels
- Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation
- EYE3:Turn Anything into Naked-eye 3D
- QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
- RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
- LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
- ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
- MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
- Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors
- Stochastic Gradient Estimation for Higher-Order Differentiable Rendering
- Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs
- DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection
- Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction
- Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
- InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
- ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
- A Real-world Display Inverse Rendering Dataset
- From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning
- Constraint-Aware Feature Learning for Parametric Point Cloud
- Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping
- Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction
- Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
- VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving
- AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering
- SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
- SViM3D: Stable Video Material Diffusion for Single Image 3D Generation
- MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling
- Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising
- Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis
- Driving View Synthesis on Free-form Trajectories with Generative Prior
- All in One: Visual-Description-Guided Unified Point Cloud Segmentation
- CF3: Compact and Fast 3D Feature Fields
- SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching
- ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
- Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations
- RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather
- From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
- 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
- Spatially-Varying Autofocus
- MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
- Wide2Long: Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation
- Online Language Splatting
- UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields
- DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
- DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization
- EDM: Efficient Deep Feature Matching
- Tile-wise vs. Image-wise: Random-Tile Loss and Training Paradigm for Gaussian Splatting
- NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
- GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
- SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations
- Purge-Gate: Efficient Backpropagation-Free Test-Time Adaptation for Point Clouds via Token purging
- JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
- EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting
- LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment
- Debiasing Trace Guidance: Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection
- NeurOp-Diff: Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion
Receptions
Tutorials
Workshops
- Vision-based AI for Digital Health: From Pixels to Practics
- Affective & Behavior Analysis in-the-wild
- Authenticity & Provenance in the age of Generative AI
- Computer Vision for Fashion; Art; and Design: Bridging Creativity and Responsible AI
- Systematic Trust in AI Models: Ensuring Fairness; Reliability; Explainability; and Accountability in Machine Learning Frameworks
- The Eighth International Workshop on Computer Vision for Physiological Measurement (CVPM)
- Computer Vision for Developing Countries
- Workshop on Biomedical Image and Signal Computing for Unbiasedness; Interpretability; and Trustworthiness
- 2nd Beyond Euclidean Workshop: Hyperbolic and Hyperspherical Learning for Computer Vision
- Binocular Egocentric-360 Multi-modal Scene Understanding in the Wild
- Workshop on Safe and Trustworthy Multimodal AI Systems
- Multispectral Imaging for Robotics and Automation
- The 2nd AI for Visual Arts Workshop and Challenges
- Embedded Vision Workshop
- Artificial Social Intelligence Workshop
- Multimodal Continual Learning
- Transparent & Reflective objects In the wild Challenges
- 10th International Workshop on Recovering 6D Object Pose
- Multi-Modal Foundation Models for Cancer Detection and Prevention
- The 2nd Workshop on Efficient Computing under Limited Resources: Visual Computing
- Workshop on Graphic Design Understanding and Generation
- 6th Workshop on Continual Learning in Computer Vision
- Workshop on Advanced Perception for Autonomous Healthcare
- Workshop on Cultural Continuity of Artists: Leveraging Artistic Legacies for AI-Driven Cultural Heritage
- Computer Vision for Biometrics; Identity & Behavior Science
- 9th AI City Challenge
- Closing the Loop Between Vision and Language (Decade Mark)
- Computer Vision in Advertising and Marketing
- Robust and Interactable World Models in Computer Vision
- Generative AI for Biomedical Image Analysis: Opportunities; Challenges and Futures
- Workshop on Distillation of Foundation Models for Autonomous Driving
- Personalization in Generative AI Workshop
- Generating Digital Twins from Images and Videos
- The 12th IEEE International Workshop on Analysis and Modeling of Faces and Gestures
- Ego-Exo Sensing for Smart Mobility
- What is Next in Multimodal Foundation Models?
- Embodied Spatial Reasoning
- The 3rd workshop on Binary and Extreme Quantization for Computer Vision
- Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities
- 1st Workshop and Challenge on Category-Level Object Pose Estimation for Robotic Manipulation
- BioImage Computing
- Large Scale Cross Device Localization
- 1st Workshop on Long Multi-Scene Video Foundations: Generation; Understanding and Evaluation
- Responsible Imaging
- Biometrics for Art
- 2nd Workshop on Audio-Visual Generation and Learning
- PHAROS - Adaptation; Fairness; Explainability in AI Medical Imaging
- Human-Robot-Scene Interaction and Collaboration
- International Workshop on Observing and Understanding Hands in Action
Report issues here.
Successful Page Load