ICCV 2025 Events with Videos
Posters
- IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
- VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
- VALLR: Visual ASR Language Model for Lip Reading
- Boosting Adversarial Transferability via Residual Perturbation Attack
- MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
- Open-set Cross Modal Generalization via Multimodal Unified Representation
- FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization
- Dataset Distillation via Vision-Language Category Prototype
- Adversarial Training for Probabilistic Robustness
- AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
- 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
- Backdoor Defense via Enhanced Splitting and Trap Isolation
- Diagnosing Pretrained Models for Out-of-distribution Detection
- ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models
- PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening
- LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement
- Flexi-FSCIL: Adaptive Knowledge Retention for Breaking the Stability-Plasticity Dilemma in Few-Shot Class-Incremental Learning
- Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness
- Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
- FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift
- Visual Modality Prompt for Adapting Vision-Language Object Detectors
- Moderating the Generalization of Score-based Generative Model
- On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations
- VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow
- One Encoder to Rule them All: Representation Learning for Model-free Visual Reinforcement Learning using Fourier Neural Operators
- VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
- Scaling and Taming Adversarial Training with Synthetic Data
- ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers
- Mitigating Catastrophic Overfitting in Fast Adversarial Training via Label Information Elimination
- From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
- R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
- Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations
- Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
- Federated Representation Angle Learning
- TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
- Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
- Learning to Inference Adaptively for Multimodal Large Language Models
- Diffusion Guided Adaptive Augmentation for Generalization in Visual Reinforcement Learning
- A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
- NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
- Can Knowledge be Transferred from Unimodal to Multimodal? Investigating the Transitivity of Multimodal Knowledge Editing
- SHIFT: Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models
- DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
- RANKCLIP: Ranking-Consistent Language-Image Pretraining
- Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
- Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
- TRNAS: A Training-Free Robust Neural Architecture Search
- ConstStyle: Robust Domain Generalization with Unified Style Transformation
- Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization
- Visual Intention Grounding for Egocentric Assistants
- MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
- Understanding Museum Exhibits using Vision-Language Reasoning
- Seal Your Backdoor with Variational Defense
- FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
- SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
- RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction
- CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective
- G2D: Boosting Multimodal Learning with Gradient-Guided Distillation
- Removing Cost Volumes from Optical Flow Estimators
- Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning
- BATCLIP: Bimodal Online Test-Time Adaptation for CLIP
- Controlling Multimodal LLMs via Reward-guided Decoding
- Boundary Probing for Input Privacy Protection When Using LMM Services
- TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
- Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting
- EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients
- Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection
- I Am Big, You Are Little; I Am Right, You Are Wrong
- KOEnsAttack: Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles
- Token Activation Map to Visually Explain Multimodal LLMs
- DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection
- Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
- Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning
- Is Visual in-Context Learning for Compositional Medical Tasks within Reach?
- MMOne: Representing Multiple Modalities in One Scene
- Class-Wise Federated Averaging for Efficient Personalization
- DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
- Cooperative Pseudo Labeling for Unsupervised Federated Classification
- Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection
- Knowledge Distillation with Refined Logits
- Hypergraph Clustering Network with Partial Attribute Imputation
- Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers
- Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection
- Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
- A Tiny Change, A Giant Leap: Long-Tailed Class-Incremental Learning via Geometric Prototype Alignment
- Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints
- SplatTalk: 3D VQA with Gaussian Splatting
- A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention
- GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
- LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding
- AIRA: Activation-Informed Low-Rank Adaptation for Large Models
- Effective Training Data Synthesis for Improving MLLM Chart Understanding
- Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning
- Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation
- FREE-Merging: Fourier Transform for Efficient Model Merging
- DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning
- Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
- LIRA: Reasoning Reconstruction via Multimodal Large Language Models
- CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training
- PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection
- Learnable Logit Adjustment for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch
- Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark
- GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
- A Conditional Probability Framework for Compositional Zero-shot Learning
- Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
- SMP-Attack: Boosting the Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout
- Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning
- Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning
- FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging
- SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation
- DOGR: Towards Versatile Visual Document Grounding and Referring
- INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
- DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion
- What to Distill? Fast Knowledge Distillation with Adaptive Sampling
- Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding
- Spatial Preference Rewarding for MLLMs Spatial Understanding
- Joint Asymmetric Loss for Learning with Noisy Labels
- MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models
- Robust Dataset Condensation using Supervised Contrastive Learning
- Understanding Flatness in Generative Models: Its Role and Benefits
- From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
- SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers
- DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
- BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
- Tensor-aggregated LoRA in Federated Fine-tuning
- Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification
- Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy
- Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
- Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths
- Diversity-Enhanced Distribution Alignment for Dataset Distillation
- COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition
- Lark: Low-Rank Updates After Knowledge Localization for Few-shot Class-Incremental Learning
- SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking
- Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
- FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection
- AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
- Activation Subspaces for Out-of-Distribution Detection
- Evidential Knowledge Distillation
- Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning
- Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration
- Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment
- Enhancing Numerical Prediction of MLLMs with Soft Labeling
- Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
- PEFTDiff: Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning
- Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning
- Gradient Extrapolation for Debiased Representation Learning
- VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders
- GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
- PHD: Personalized 3D Human Body Fitting with Point Diffusion
- BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
- Princeton365: A Diverse Dataset with Accurate Camera Pose
- PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
- Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
- WildSAT: Learning Satellite Image Representations from Wildlife Observations
- DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation
- RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction
- Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels
- Zero-shot Inexact CAD Model Alignment from a Single Image
- Find Any Part in 3D
- Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes
- Online Dense Point Tracking with Streaming Memory
- C4D: 4D Made from 3D through Dual Correspondences
- PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
- Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
- Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation
- SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
- Combinative Matching for Geometric Shape Assembly
- CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image
- MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation
- Training-Free Personalization via Retrieval and Reasoning on Fingerprints
- WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images
- SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
- Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs
- EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
- Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function
- From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
- 4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
- RoMo: Robust Motion Segmentation Improves Structure from Motion
- CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector
- EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching
- MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
- Open-Vocabulary Octree-Graph for 3D Scene Understanding
- TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
- Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
- VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition
- CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
- AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
- From Abyssal Darkness to Blinding Glare: A Benchmark on Extreme Exposure Correction in Real World
- Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry
- RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
- ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
- HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
- Balanced Sharpness-Aware Minimization for Imbalanced Regression
- PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing
- Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
- Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery
- DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
- Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
- Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics
- CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
- Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis
- GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
- Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing
- Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection
- Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
- FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
- Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm
- CityNav: A Large-Scale Dataset for Real-World Aerial Navigation
- Heavy Labels Out! Dataset Distillation with Label Space Lightening
- VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
- PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
- Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
- SAC-GNC: SAmple Consensus for adaptive Graduated Non-Convexity
- Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization
- DAViD: Data-efficient and Accurate Vision Models from Synthetic Data
- BlinkTrack: Feature Tracking over 80 FPS via Events and Images
- Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition
- GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene
- IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
- Multi-view Gaze Target Estimation
- Robust Low-light Scene Restoration via Illumination Transition
- Task-Decoupled Bézier Surface Constraint for Uneven Low-Light Image Enhancement
- GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
- PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement
- VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
- Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
- Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
- UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI
- Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification
- Generative Zoo
- Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
- AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
- Is Tracking really more challenging in First Person Egocentric Vision?
- PlaneRAS: Learning Planar Primitives for 3D Plane Recovery
- O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
- MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
- MeasureXpert: Automatic Anthropometric Measurement Extraction from Two Unregistered, Partial, Posed, and Dressed Body Scans
- Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
- Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
- MVGBench: a Comprehensive Benchmark for Multi-view Generation Models
- High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
- On the Generalization of Representation Uncertainty in Earth Observation
- Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
- Met2Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems
- egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks
- TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions
- 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
- OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance
- VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
- St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
- Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
- Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
- INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
- Time-Aware Auto White Balance in Mobile Photography
- CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy
- PBFG: A New Physically-Based Dataset and Removal of Lens Flares and Glares
- GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
- Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor
- Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
- UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
- WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions
- EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision
- Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation
- Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
- Learning 3D Scene Analogies with Neural Contextual Scene Maps
- NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
- AgroBench: Vision-Language Model Benchmark in Agriculture
- Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
- EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks
- VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting
- Future-Aware Interaction Network For Motion Forecasting
- PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
- UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
- Aether: Geometric-Aware Unified World Modeling
- Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection
- Where am I? Cross-View Geo-localization with Natural Language Descriptions
- LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
- MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
- Manual-PA: Learning 3D Part Assembly from Instruction Diagrams
- OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection
- Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras
- Multispectral Demosaicing via Dual Cameras
- A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba
- Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features
- Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models
- Selective Contrastive Learning for Weakly Supervised Affordance Grounding
- Beyond RGB: Adaptive Parallel Processing for RAW Object Detection
- MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos
- PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
- Variance-Based Pruning for Accelerating and Compressing Trained Networks
- Test-Time Retrieval-Augmented Adaptation for Vision-Language Models
- MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
- PLMP - Point-Line Minimal Problems for Projective SfM
- DCHM: Depth-Consistent Human Modeling for Multiview Detection
- STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
- CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
- OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
- Weakly-Supervised Learning of Dense Functional Correspondences
- InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
- ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
- Voyaging into Perpetual Dynamic Scenes from a Single View
- Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation
- Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge
- Language Driven Occupancy Prediction
- Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation
- M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision
- ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
- PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image
- Optical Model-Driven Sharpness Mapping for Autofocus in Small Depth-of-Field and Severe Defocus Scenarios
- GSOT3D: Towards Generic 3D Single Object Tracking in the Wild
- Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
- When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
- ArgoTweak: Towards Self-Updating HD Maps through Structured Priors
- Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
- Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
- PseudoMapTrainer: Learning Online Mapping without HD Maps
- Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
- Quanta Neural Networks: From Photons to Perception
- PanSt3R: Multi-view Consistent Panoptic Segmentation
- Harnessing Input-Adaptive Inference for Efficient VLN
- Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime
- Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
- AnimalClue: Recognizing Animals by their Traces
- DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
- Reverse Convolution and Its Applications to Image Restoration
- MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration
- Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
- DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
- PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
- Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
- InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation
- Blind Noisy Image Deblurring Using Residual Guidance Strategy
- V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video
- DisenQ: Disentangling Q-Former for Activity-Biometrics
- Sequential Gaussian Avatars with Hierarchical Motion Context
- MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
- AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
- LayerAnimate: Layer-level Control for Animation
- Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
- Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation
- CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation
- SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
- PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution
- AV-Flow: Transforming Text to Audio-Visual Human-like Interactions
- MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
- AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation
- Towards a Unified Copernicus Foundation Model for Earth Vision
- SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation
- SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis
- LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
- VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction
- EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
- KinMo: Kinematic-aware Human Motion Understanding and Generation
- What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
- FreeDance: Towards Harmonic Free-Number Group Dance Generation via a Unified Framework
- RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model
- Q-Norm: Robust Representation Learning via Quality-Adaptive Normalization
- FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads
- Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition
- Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
- PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
- DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads
- Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior
- DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
- 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
- Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables
- Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion
- Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning
- DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
- Human-Object Interaction from Human-Level Instructions
- Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
- HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation
- Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation
- GUAVA: Generalizable Upper Body 3D Gaussian Avatar
- InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
- Generic Event Boundary Detection via Denoising Diffusion
- StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors
- UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control
- Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis
- Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
- PrimHOI: Compositional Human-Object Interaction via Reusable Primitives
- NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping
- Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement
- LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
- Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation
- UDC-VIT: A Real-World Video Dataset for Under-Display Cameras
- FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
- Context-Aware Academic Emotion Dataset and Benchmark
- MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
- Consistency Trajectory Matching for One-Step Generative Super-Resolution
- Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
- Autoregressive Denoising Score Matching is a Good Video Anomaly Detector
- DIMO: Diverse 3D Motion Generation for Arbitrary Objects
- Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition
- iManip: Skill-Incremental Learning for Robotic Manipulation
- Exploiting Diffusion Prior for Task-driven Image Restoration
- Learning Efficient and Generalizable Human Representation with Human Gaussian Model
- Capturing head avatar with hand contacts from a monocular video
- The Source Image is the Best Attention for Infrared and Visible Image Fusion
- Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition
- EAMamba: Efficient All-Around Vision State Space Model for Image Restoration
- IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution
- CarGait: Cross-Attention based Re-ranking for Gait recognition
- PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement
- VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
- Multi-modal Multi-platform Person Re-Identification: Benchmark and Method
- Learning Hierarchical Line Buffer for Image Processing
- GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
- Multi-Object Sketch Animation by Scene Decomposition and Motion Planning
- FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching
- MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion
- LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association
- Expressive Talking Human from Single-Image with Imperfect Priors
- MOVE: Motion-Guided Few-Shot Video Object Segmentation
- EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment
- Joint Self-Supervised Video Alignment and Action Segmentation
- SHeaP: Self-supervised Head Geometry Predictor Learned via 2D Gaussians
- Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection
- Separation for Better Integration: Disentangling Edge and Motion in Event-based Deblurring
- GMMamba: Group Masking Mamba for Whole Slide Image Classification
- DexVLG: Dexterous Vision-Language-Grasp Model at Scale
- EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow
- EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
- PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
- Understanding Co-speech Gestures in-the-wild
- DreamRelation: Relation-Centric Video Customization
- MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
- General Compression Framework for Efficient Transformer Object Tracking
- MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation
- Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection
- G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
- SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
- Blind Video Super-Resolution based on Implicit Kernels
- SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting
- SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
- VSRM: A Robust Mamba-Based Framework for Video Super-Resolution
- PersonaCraft: Personalized and Controllable Full-Body Multi-Human Scene Generation Using Occlusion-Aware 3D-Conditioned Diffusion
- Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training
- Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution
- RoboPearls: Editable Video Simulation for Robot Manipulation
- UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments
- F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration
- ChartCap: Mitigating Hallucination of Dense Chart Captioning
- Face Retouching with Diffusion Data Generation and Spectral Restorement
- UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
- Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model
- Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
- GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
- Skeleton Motion Words for Unsupervised Skeleton-based Temporal Action Segmentation
- CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
- Identity Preserving 3D Head Stylization with Multiview Score Distillation
- Towards a Universal Image Degradation Model via Content-Degradation Disentanglement
- Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking
- GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule
- AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation
- DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing
- GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars
- DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
- OneGT: One-Shot Geometry-Texture Neural Rendering for Head Avatars
- Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
- FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
- TeRA: Rethinking Text-guided Realistic 3D Avatar Generation
- Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration
- DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover
- IDFace: Face Template Protection for Efficient and Secure Identification
- AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
- MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence
- Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
- DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
- Efficient Concertormer for Image Deblurring and Beyond
- Augmented Mass-Spring Model for Real-Time Dense Hair Simulation
- DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
- AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm
- MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
- MistSense: Versatile Online Detection of Procedural and Execution Mistakes
- Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars
- Efficient Track Anything
- ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy
- VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
- Controllable Weather Synthesis and Removal with Video Diffusion Models
- FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling
- Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising
- LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
- Multi-modal Identity Extraction
- Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars
- GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting
- SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing
- SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition
- Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors
- Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions
- FlowStyler: Artistic Video Stylization via Transformation Fields Transports
- ZFusion: Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior
- MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
- Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
- DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
- Learning Streaming Video Representation via Multitask Training
- SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
- Textured 3D Regenerative Morphing with 3D Diffusion Prior
- Blended Point Cloud Diffusion for Localized Text-guided Shape Editing
- TCFG: Truncated Classifier-Free Guidance for Efficient and Scalable Text-to-Image Acceleration
- Versatile Transition Generation with Image-to-Video Diffusion
- AnyPortal: Zero-Shot Consistent Video Background Replacement
- UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint
- TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
- FiVE-Bench: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models
- Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting
- Continual Personalization for Diffusion Models
- DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
- DLF: Extreme Image Compression with Dual-generative Latent Fusion
- Streamlining Image Editing with Layered Diffusion Brushes
- PlugMark: A Plug-in Zero-Watermarking Framework for Diffusion Models
- DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images
- DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing
- Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models
- PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
- Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing
- LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching
- DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models
- Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation
- On Large Multimodal Models as Open-World Image Classifiers
- GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
- FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
- Text Embedding Knows How to Quantize Text-Guided Diffusion Models
- Multi-turn Consistent Image Editing
- Deterministic Object Pose Confidence Region Estimation
- DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models
- VACE: All-in-One Video Creation and Editing
- DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization
- Zero-Shot Depth Aware Image Editing with Diffusion Models
- Hybrid Layout Control for Diffusion Transformer: Fewer Annotations, Superior Aesthetics
- Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
- Certifiably Optimal Anisotropic Rotation Averaging
- Outlier-Aware Post-Training Quantization for Image Super-Resolution
- GReg: Geometry-Aware Region Refinement for Sign Language Video Generation
- SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
- LUSD: Localized Update Score Distillation for Text-Guided Image Editing
- LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
- SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting
- Make Me Happier: Evoking Emotions Through Image Diffusion Models
- Accelerating Diffusion Transformer via Gradient-Optimized Cache
- FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
- DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
- Semantic Discrepancy-aware Detector for Image Forgery Identification
- Wasserstein Style Distribution Analysis and Transform for Stylized Image Generation
- CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
- DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
- CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models
- TokensGen: Harnessing Condensed Tokens for Long Video Generation
- Adaptive Caching for Faster Video Generation with Diffusion Transformers
- Region-Level Data Attribution for Text-to-Image Generative Models
- ART: Adaptive Relation Tuning for Generalized Relation Prediction
- PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask
- Generative Adversarial Diffusion
- Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation
- Fine-Tuning Visual Autogressive Models for Subject-Driven Generation
- IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
- Beyond Blur: A Fluid Perspective on Generative Diffusion Models
- Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis
- Training-free Geometric Image Editing on Diffusion Models
- PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
- DDB: Diffusion Driven Balancing to Address Spurious Correlations
- V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
- VSC: Visual Search Compositional Text-to-Image Diffusion Model
- Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
- TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models
- From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
- Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping
- ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
- Beyond Perspective: Neural 360-Degree Video Compression
- Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention
- Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement
- From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
- Mobile Video Diffusion
- AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
- Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection
- Anti-Tamper Protection for Unauthorized Individual Image Generation
- StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance
- Scalable Image Tokenization with Index Backpropagation Quantization
- DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
- DiffIP: Representation Fingerprints for Robust IP Protection of Diffusion Models
- TryOn-Refiner: Conditional Rectified-flow-based TryOn Refiner for More Accurate Detail Reconstruction
- Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
- Learning Robust Image Watermarking with Lossless Cover Recovery
- FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process
- Frequency-Guided Diffusion for Training-Free Text-Driven Image Translation
- Neighboring Autoregressive Modeling for Efficient Visual Generation
- SummDiff: Generative Modeling of Video Summarization with Diffusion
- Generative Video Bi-flow
- Denoising Token Prediction in Masked Autoregressive Models
- SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
- Rethink Sparse Signals for Pose-guided Text-to-image Generation
- Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity
- CompleteMe: Reference-based Human Image Completion
- UnZipLoRA: Separating Content and Style from a Single Image
- ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
- IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation
- EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model
- MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
- Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
- VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
- Split-and-Combine: Enhancing Style Augmentation for Single Domain Generalization
- LACONIC: A 3D Layout Adapter for Controllable Image Creation
- Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis
- Progressive Artwork Outpainting via Latent Diffusion Models
- Edicho: Consistent Image Editing in the Wild
- DIVE: Taming DINO for Subject-Driven Video Editing
- Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
- FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
- Pretrained Reversible Generation as Unsupervised Visual Representation Learning
- CAP: Evaluation of Persuasive and Creative Image Generation
- Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
- Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
- Golden Noise for Diffusion Models: A Learning Framework
- Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
- Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
- VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation
- Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
- OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
- DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
- LEGO-Maker: A Semantic-Driven Algorithm for Text-to-3D Generation
- CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation
- LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization
- Text2Outfit: Controllable Outfit Generation with Multimodal Language Models
- ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection
- Spectral Image Tokenizer
- Magic Insert: Style-Aware Drag-and-Drop
- Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts
- Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal
- PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation
- LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
- DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
- GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
- SDMatte: Grafting Diffusion Models for Interactive Matting
- Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
- LEGION: Learning to Ground and Explain for Synthetic Image Detection
- Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer
- IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models
- Preserve Anything: Controllable Image Synthesis with Object Preservation
- ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
- Instruction-based Image Editing with Planning, Reasoning, and Generation
- Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
- Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model
- MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment
- Spatial-Temporal Forgery Trace based Forgery Image Identification
- Few-Shot Pattern Detection via Template Matching and Regression
- Ensemble Foreground Management for Unsupervised Object Discovery
- Training-Free Industrial Defect Generation with Diffusion Models
- Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
- Aligning Moments in Time using Video Queries
- Counting Stacked Objects
- Robustifying Zero-Shot Vision Language Models by Subspaces Alignment
- Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
- Adapt Foundational Segmentation Models with Heterogeneous Searching Space
- Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
- ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detection
- Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
- Task Vector Quantization for Memory-Efficient Model Merging
- COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation
- B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
- Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking
- CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation
- ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba
- Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
- LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
- Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
- STDDNet: Harnessing Mamba for Video Polyp Segmentation via Spatial-aligned Temporal Modeling and Discriminative Dynamic Representation Learning
- GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
- LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
- ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis
- Visual Textualization for Image Prompted Object Detection
- Vision-Language Neural Graph Featurization for Extracting Retinal Lesions
- Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis
- ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
- Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
- UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation
- MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval
- Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning
- SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting
- MaskSAM: Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation
- ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers
- MRGen: Segmentation Data Engine For Underrepresented MRI Modalities
- HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss
- HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos
- TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
- Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
- Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts
- OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning
- SIC: Similarity-Based Interpretable Image Classification with Neural Networks
- Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation
- ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
- Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
- Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
- CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
- Music Grounding by Short Video
- PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
- End-to-End Multi-Modal Diffusion Mamba
- Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation
- CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts
- CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation
- Soft Local Completeness: Rethinking Completeness in XAI
- Agreement aware and dissimilarity oriented GLOM
- MSA2: Multi-task Framework with Structure-aware and Style-adaptive Character Representation for Open-set Chinese Text Recognition
- Zero-Shot Compositional Video Learning with Coding Rate Reduction
- VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
- CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition
- MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
- MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
- ViLLa: Video Reasoning Segmentation with Large Language Model
- Superpowering Open-Vocabulary Object Detectors for X-ray Vision
- TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
- LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering
- Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching
- CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
- Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation
- Rectifying Magnitude Neglect in Linear Attention
- Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
- Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation
- Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction
- Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
- WeaveSeg: Iterative Contrast-weaving and Spectral Feature-refining for Nuclei Instance Segmentation
- Growing a Twig to Accelerate Large Vision-Language Models
- CountSE: Soft Exemplar Open-set Object Counting
- SignRep: Enhancing Self-Supervised Sign Representations
- D-Attn: Decomposed Attention for Large Vision-and-Language Model
- ProbMED: A Probabilistic Framework for Medical Multimodal Binding
- LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation
- Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation
- MEH: A Multi-Style Dataset and Toolkit for Advancing Egyptian Hieroglyph Recognition
- HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
- ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
- Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
- Object-centric Video Question Answering with Visual Grounding and Referring
- AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model
- Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
- Modeling Saliency Dataset Bias
- Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection
- METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
- RA-BUSSeg: Relation-aware Semi-supervised Breast Ultrasound Image Segmentation via Adjacent Propagation and Cross-layer Alignment
- AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images
- LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
- Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
- CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization
- Multi-modal Segment Anything Model for Camouflaged Scene Segmentation
- Memory-Efficient 4-bit Preconditioned Stochastic Optimization
- Temperature in Cosine-based Softmax Loss
- Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization
- MSQ: Memory-Efficient Bit Sparsification Quantization
- Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations
- NETracer: A Topology-Aware Iterative Tracing Approach for Tubular Structure Extraction
- The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning
- How Can Objects Help Video-Language Understanding?
- Efficient Fine-Tuning of Large Models via Nested Low-Rank Adaptation
- Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement
- Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations
- Is CLIP ideal? No. Can we fix it? Yes!
- Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching
- Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation
- G2PDiffusion: Cross-species Genotype-to-Phenotype Prediction via Evolutionary Diffusion
- Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
- Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training
- Unified Open-World Segmentation with Multi-Modal Prompts
- UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
- Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
- LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
- Everything is a Video: Unifying Modalities through Next-Frame Prediction
- ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
- Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in Medical Image Segmentation with Learnable Prior
- C2MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
- MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning
- Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
- PVChat: Personalized Video Chat with One-Shot Learning
- Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation
- CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
- MINERVA: Evaluating Complex Video Reasoning
- Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation
- SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
- YOLOE: Real-Time Seeing Anything
- Incremental Few-Shot Semantic Segmentation via Multi-Level Switchable Visual Prompts
- Factorized Learning for Temporally Grounded Video-Language Models
- AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
- Large-scale Pre-training for Grounded Video Caption Generation
- Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D
- When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
- Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator
- Emulating Self-attention with Convolution for Efficient Image Super-Resolution
- RadGPT: Constructing 3D Image-Text Tumor Datasets
- When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection
- Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels
- Auto-Vocabulary Semantic Segmentation
- Scheduling Weight Transitions for Quantization-Aware Training
- Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA
- Advancing Visual Large Language Model for Multi-granular Versatile Perception
- FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
- 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
- Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping
- ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
- OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering
- DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection
- Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving
- HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network
- GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
- Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
- Purge-Gate: Efficient Backpropagation-Free Test-Time Adaptation for Point Clouds via Token purging
- PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors
- RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians
- M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking
- GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization
- Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis
- IM360: Large-scale Indoor Mapping with 360 Cameras
- χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement
- Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors
- Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes
- Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning
- Epona: Autoregressive Diffusion World Model for Autonomous Driving
- All in One: Visual-Description-Guided Unified Point Cloud Segmentation
- MiDSummer: Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation
- FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging
- Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion
- Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography
- Axis-level Symmetry Detection with Group-Equivariant Representation
- Monocular Semantic Scene Completion via Masked Recurrent Networks
- Controllable 3D Outdoor Scene Generation via Scene Graphs
- RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters
- DAA*: Deep Angular A Star for Image-based Path Planning
- SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting
- SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations
- InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
- PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model
- DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
- Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
- Correspondence-Free Fast and Robust Spherical Point Pattern Registration
- Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
- Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging
- JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
- CF3: Compact and Fast 3D Feature Fields
- 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
- SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching
- Decoupled Diffusion Sparks Adaptive Scene Generation
- Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array
- Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
- GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting
- QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
- EYE3:Turn Anything into Naked-eye 3D
- From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
- Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation
- ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
- Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance
- Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns
- FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
- SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
- RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction
- Scene Coordinate Reconstruction Priors
- 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
- NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement
- UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields
- Spatially-Varying Autofocus
- EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images
- AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion
- MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
- UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
- Global-Aware Monocular Semantic Scene Completion with State Space Models
- MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments
- Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing
- VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data
- Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction
- Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
- Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation
- Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds
- MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling
- Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
- Explaining Human Preferences via Metrics for Structured 3D Reconstruction
- WIPES: Wavelet-based Visual Primitives
- SP2T: Sparse Proxy Attention for Dual-stream Point Transformer
- RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
- Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
- EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
- Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis
- GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
- CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds
- Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes
- Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
- AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
- MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency
- S$^3$E: Self-Supervised State Estimation for Radar-Inertial System
- DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization
- Towards Foundational Models for Single-Chip Radar
- BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis
- NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
- From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning
- CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
- WonderTurbo: Generating Interactive 3D World in 0.72 Seconds
- Tile-wise vs. Image-wise: Random-Tile Loss and Training Paradigm for Gaussian Splatting
- SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
- Constraint-Aware Feature Learning for Parametric Point Cloud
- Discontinuity-aware Normal Integration for Generic Central Camera Models
- RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
- Inverse 3D Microscopy Rendering for Cell Shape Inference with Active Mesh
- Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints
- GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion
- MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances
- Stochastic Gradient Estimation for Higher-Order Differentiable Rendering
- ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
- ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
- NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals
- Towards Safer and Understandable Driver Intention Prediction
- Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues
- HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
- Splat-based 3D Scene Reconstruction with Extreme Motion-blur
- VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions
- Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View
- Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction
- Lifting the Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling
- MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching
- VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving
- Heatmap Regression without Soft-Argmax for Facial Landmark Detection
- SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
- SViM3D: Stable Video Material Diffusion for Single Image 3D Generation
- You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
- CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
- LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment
- Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves
- Robust Unfolding Network for HDR Imaging with Modulo Cameras
- EVT: Efficient View Transformation for Multi-Modal 3D Object Detection
- Recovering Parametric Scenes from Very Few Time-of-Flight Pixels
- SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
- Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations
- Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction
- NeuFrameQ: Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation
- UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images
- Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
- Wide2Long: Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation
- Online Language Splatting
- Global Regulation and Excitation via Attention Tuning for Stereo Matching
- SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation
- EDM: Efficient Deep Feature Matching
- DONUT: A Decoder-Only Model for Trajectory Prediction
- MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
- GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors
- Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising
- Driving View Synthesis on Free-form Trajectories with Generative Prior
- RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather
- EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting
- AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering
- Debiasing Trace Guidance: Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection
- NeurOp-Diff: Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion
Report issues here.