Poster
Wenhao Wang · Yi Yang
[ Exhibit Hall I ]
Abstract
Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce **TIP-I2V**, the first large-scale dataset of over $1.70$ million unique user-provided **T**ext and **I**mage **P**rompts specifically for **I**mage-to-**V**ideo generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset.The dataset is anonymously available at https://huggingface.co/datasets/tipi2v/TIP-I2V.
Poster
Anand Kumar · Jiteng Mu · Nuno Vasconcelos
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns among artists regarding data privacy and copyright infringement. Gradually, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as Introspective Style attribution (IntroStyle) and is shown to perform superior to state-of-the-art models for style retrieval. We also introduce a synthetic Artistic Style Split (ArtSplit) dataset to isolate artistic style and evaluate fine-grained style attribution performance.
Poster
Pingchuan Ma · Xiaopei Yang · Ming Gui · Yusong Li · Felix Krause · Johannes Schusterbauer · Björn Ommer
[ Exhibit Hall I ]
Abstract
The human perception of style and content is inherently subjective and varies widely. Likewise, computer vision models learn diverse latent representations of these attributes. While generative models focus on stylization and content transfer, discriminative approaches aim to capture effective representations of style and content. However, explicitly defining these attributes remains inherently difficult. To address this, we propose a method that implicitly discovers style and content representations within a semantic-rich compact space, avoiding spatial token constraints. Leveraging flow matching, our framework effectively separates style and content without predefined definitions, offering a structured yet flexible representation that can be directly applied to any precomputed CLIP embeddings. To further facilitate this, we have curated a dataset of $510{,}000$ samples ($51$ styles $\times$ $10{,}000$ content samples) for training and evaluating our model. While our method provides a strong foundation for representation learning, it is also adaptable for controllable generation tasks. We demonstrated our implicitly learned style and content representations can generalize well to ImageNet-1k and WikiArt in a zero-shot fashion. We showcase promising visual results involving various styles and contents. \textit{We will release the code and the curated dataset.}
Poster
Jinpei Guo · Zheng Chen · Wenbo Li · Yong Guo · YULUN ZHANG
[ Exhibit Hall I ]
Abstract
Diffusion models have demonstrated remarkable success in image restoration tasks. However, their multi-step denoising process introduces significant computational overhead, limiting their practical deployment. Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a \textbf{c}ompression-aware \textbf{o}ne-step \textbf{diff}usion model for JPEG artifact removal. The core of CODiff is the compression-aware visual embedder (CaVE), which extracts and leverages JPEG compression priors to guide the diffusion model. We propose a dual learning strategy that combines explicit and implicit learning. Specifically, explicit learning enforces a quality prediction objective to differentiate low-quality images with different compression levels. Implicit learning employs a reconstruction objective that enhances the model's generalization. This dual learning allows for a deeper and more comprehensive understanding of JPEG compression. Experimental results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics. The code and models will be released.
Poster
Zhenxiong Tan · Songhua Liu · Xingyi Yang · Qiaochu Xue · Xinchao Wang
[ Exhibit Hall I ]
Abstract
We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only specific control tasks effectively, limiting their practical versatility. OminiControl addresses these limitations through three key innovations: (1) a minimal architectural design that leverages the DiT's own VAE encoder and transformer blocks, requiring just 0.1\% additional parameters; (2) a unified sequence processing strategy that combines condition tokens with image tokens for flexible token interactions; and (3) a dynamic position encoding mechanism that adapts to both spatially-aligned and non-aligned control tasks. Our extensive experiments show that this streamlined approach not only matches but surpasses the performance of specialized methods across multiple conditioning tasks. To overcome data limitations in subject-driven generation, we also introduce Subjects200K, a large-scale dataset of identity-consistent image pairs synthesized using DiT models themselves. This work demonstrates that effective image control can be achieved without architectural complexity, opening new possibilities for efficient and versatile image generation systems.
Poster
Lijie Liu · Tianxiang Ma · Bingchuan Li · Zhuowei Chen · Jiawei Liu · Gen Li · SiYu Zhou · Qian HE · Xinglong Wu
[ Exhibit Hall I ]
Abstract
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references.Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves perfect subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions.In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.
Poster
Haoyang Xu · Tianhao Zhao · Sibei Yang · Yutian Lin
[ Exhibit Hall I ]
Abstract
Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts can undermine the model's performance in downstream applications such as dataset synthesis and video generation using 2D prior-based models. % that demand visual accuracy, such as e-commerce product imaging and realistic digital content creation.In this study, we conduct the in-depth analysis of this issue and reveal that the primary culprit behind incomplete object generation is $\textit{RandomCrop}$. This data augmentation method, widely used in diffusion models, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values occurring at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.
Poster
Alejandro Pardo · Fabio Pizzati · Tong Zhang · Alexander Pondaven · Philip Torr · Juan Perez · Bernard Ghanem
[ Exhibit Hall I ]
Abstract
Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting impactful match-cuts is a challenging and resource-intensive process that requires deliberate artistic planning throughout the production pipeline. In this work, we introduce MatchDiffusion, a training-free method that uses text-to-video diffusion models to automatically generate match-cuts. As such, MatchDiffusion is the first method for match-cut generation. Our method leverages an inherent property of diffusion models, whereby the early denoising steps determine the broad appearance of the scene, while the latter steps add details. Motivated by this property, MatchDiffusion first performs "Joint Diffusion", by initializing generation for two prompts from a shared noise sample, and following a shared denoising path for the first denoising steps.This process results in the two videos sharing structural and motion characteristics. After Joint Diffusion, we then conduct "Disjoint Diffusion", allowing the videos' denoising paths to diverge and introduce their unique details. MatchDiffusion thus yields visually coherent videos that are amenable to match-cuts. We demonstrate the effectiveness of our method through user studies and metrics, showing its potential to democratize match-cut creation.
Poster
Zhengyao Lyu · Chenyang Si · Tianlin Pan · Zhaoxi Chen · Kwan-Yee K. Wong · Yu Qiao · Ziwei Liu
[ Exhibit Hall I ]
Abstract
Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details.To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert. Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models will be made publicly available.
Poster
Yijing Lin · Mengqi Huang · Shuhan Zhuang · Zhendong Mao
[ Exhibit Hall I ]
Abstract
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, \eg, it achieves a 14.5\% improvement in subject similarity for customized generation and a 10\% enhancement in image quality for canny-to-image task.
Poster
Jimin Dai · Jiexi Yan · Jian Yang · lei luo
[ Exhibit Hall I ]
Abstract
The Reflow operation aims to straighten the inference trajectories of the rectified flow during training by constructing deterministic couplings between noises and images, thereby improving the quality of generated images in single-step or few-step generation. However, we identify critical limitations in Reflow, particularly its inability to rapidly generate high-quality images due to a distribution gap between images in its constructed deterministic couplings and real images. To address these shortcomings, we propose a novel alternative called Straighten Viscous Rectified Flow via Noise Optimization (VRFNO), which is a joint training framework integrating an encoder and a neural velocity field. VRFNO introduces two key innovations: (1) a historical velocity term that enhances trajectory distinction, enabling the model to more accurately predict the velocity of the current trajectory, and (2) the noise optimization through reparameterization to form optimized couplings with real images which are then utilized for training, effectively mitigating errors caused by Reflow's limitations. Comprehensive experiments on synthetic data and real datasets with varying resolutions show that VRFNO significantly mitigates the limitations of Reflow, achieving state-of-the-art performance in both one-step and few-step generation tasks.
Poster
Yuekun Dai · Haitian Li · Shangchen Zhou · Chen Change Loy
[ Exhibit Hall I ]
Abstract
RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. Experimental results show that our approach outperforms existing pipelines. Our code and benchmark will be publicly available.
Poster
Jianwei Fei · Yunshu Dai · Peipeng Yu · Zhe Kong · Jiantao Zhou · Zhihua Xia
[ Exhibit Hall I ]
Abstract
The commercialization of generative artificial intelligence (GenAI) has led to a multi-level ecosystem involving model developers, service providers, and consumers. Thus, ensuring traceability is crucial, as service providers may violate intellectual property rights (IPR), and consumers may generate harmful content. However, existing methods are limited to single-level attribution scenarios and cannot simultaneously trace across multiple levels. To this end, we introduce a scalable dual fingerprinting method for text-to-image (T2I) models, to achieve traceability of both service providers and consumers. Specifically, we propose 2-headed Fingerprint-Informed Low-Rank Adaptation (FI-LoRA), where each head is controlled by a binary fingerprint and capable of introducing the fingerprints into generated images. In practice, one FI-LoRA head is used by the developer to assign a unique fingerprint to each service provider, while the other is made available to service providers for embedding consumer-specific fingerprints during image generation. Our method does not merely embed two fingerprints within the generated image but instead allows independent control over them at developer level and business level, enabling simultaneous traceability of businesses and consumers. Experiments show that our method applies to various image generation and editing tasks of multiple T2I models, and can achieve over 99.9\% extraction accuracy for both fingerprints. Our …
Poster
Junyi Wu · Zhiteng Li · Zheng Hui · YULUN ZHANG · Linghe Kong · Xiaokang Yang
[ Exhibit Hall I ]
Abstract
Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72× on Open-Sora with minimal loss in generation quality. Extensive evaluations across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. We will release all code and models to facilitate further research.
Poster
Shivani Mall · Joao F. Henriques
[ Exhibit Hall I ]
Abstract
Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning.We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current …
Poster
jiale chen · Wei Wang · Chongyang Shi · Li Dong · Xiping Hu
[ Exhibit Hall I ]
Abstract
Watermarking as a traceable authentication technology has been widely applied in image copyright protection. However, most existing watermarking methods embed watermarks by adding irremovable perturbations to the cover image, causing permanent distortion. To address this issue, we propose a novel watermarking approach termed \textbf{C}over-\textbf{R}ecoverable Water\textbf{Mark} (CRMark). CRMark can losslessly recover the cover image and watermark in lossless channels and enables robust watermark extraction in lossy channels. CRMark leverages an integer Invertible Watermarking Network (iIWN) to achieve a lossless invertible mapping between the cover-image-watermark pair and the stego image. During the training phase, CRMark employs an encoder-noise-layer-decoder architecture to enhance its robustness against distortions. In the inference phase, CRMark first maps the cover-image-watermark pair into an overflowed stego image and a latent variable. Subsequently, the overflowed pixels and the latent variable are losslessly compressed into an auxiliary bitstream, which is then embedded into the clipped stego image using reversible data hiding. During extraction, in lossy channels, the noised stego image can directly undergo inverse mapping via iIWN to extract the watermark. In lossless channels, the latent variable and overflowed stego image are first recovered using reversible data hiding, followed by watermark extraction through iIWN. Extensive experimental results demonstrate that CRMark can …
Poster
Yangyang Xu · Bangzhen Liu · Wenqi Shao · Yong Du · Shengfeng He · Tingting Zhu
[ Exhibit Hall I ]
Abstract
Decoding stimulus images from fMRI signals has advanced with pre-trained generative models. However, existing methods struggle with cross-subject mappings due to cognitive variability and subject-specific differences. This challenge arises from sequential errors, where unidirectional mappings generate partially inaccurate representations that, when fed into diffusion models, accumulate errors and degrade reconstruction fidelity. To address this, we propose the Bidirectional Autoencoder Intertwining framework for accurate mind representation prediction. Our approach unifies multiple subjects through a Subject Bias Modulation Module while leveraging bidirectional mapping to better capture data distributions for precise representation prediction. To further enhance fidelity when decoding representations into stimulus images, we introduce a Semantic Refinement Module to improve semantic representations and a Visual Coherence Module to mitigate the effects of inaccurate visual representations. Integrated with ControlNet and Stable Diffusion, our method outperforms state-of-the-art approaches on benchmark datasets in both qualitative and quantitative evaluations. Moreover, our framework exhibits strong adaptability to new subjects with minimal training samples.
Poster
Jiancheng Zhao · Yifan Zhan · Qingtian Zhu · Mingze Ma · Muyao Niu · Zunian Wan · Xiang Ji · Yinqiang Zheng
[ Exhibit Hall I ]
Abstract
Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance.To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.
Poster
Yuhang Ma · Keqiang Sun · Xiaoshi Wu · Hongsheng Li
[ Exhibit Hall I ]
Abstract
Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3), which comprises: (1) HPDv3, the first full-spectrum human preference dataset integrating 1.7M text-image pairs and 1M annotated pairwise comparisons from state-of-the-art generative models and high-quality real-world images, and (2) a preference model leveraging VLM-based feature extraction and RankNet loss for fine-grained ranking. Furthermore, we propose Chain-of-Human-Preference (CoHP), a novel reasoning approach for iterative image refinement. CoHP improves image quality efficiently without requiring additional training data. By using HPSv3 as a reward model, CoHP ensures that the highest-quality image is selected at each iteration, progressively enhancing the output. Extensive experiments demonstrate that HPSv3 serves as a robust benchmark for full-spectrum image evaluation, and CoHP offers an efficient, human-aligned approach to enhancing image generation quality.
Poster
Kwanseok Kim · Jaehoon Hahm · Sumin Kim · Jinhwan Sul · Byung-Hak Kim · Joonseok Lee
[ Exhibit Hall I ]
Abstract
Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a "good" summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.
Poster
Xueqing Deng · Linjie Yang · Qihang Yu · Chenglin Yang · Liang-Chieh (Jay) Chen
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) models have advanced rapidly with diffusion-based breakthroughs, yet their evaluation remains challenging. Human assessments are costly, and existing automated metrics lack accurate compositional understanding. To address these limitations, we introduce PSG-Bench, a novel benchmark featuring 5K text prompts designed to evaluate the capabilities of advanced T2I models. Additionally, we propose PSGEval, a scene graph-based evaluation metric that converts generated images into structured representations and applies graph matching techniques for accurate and scalable assessment. PSGEval is a detection based evaluation metric without relying on QA generations. Our experimental results demonstrate that PSGEval aligns well with human evaluations, mitigating biases present in existing automated metrics. We further provide a detailed ranking and analysis of recent T2I models, offering a robust framework for future research in T2I evaluation.
Poster
Nisha Huang · Henglin Liu · Yizhou Lin · Kaer Huang · Chubin Chen · Jie Guo · Tong-Yee Lee · Xiu Li
[ Exhibit Hall I ]
Abstract
Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.
Poster
Yifei Zhang · Lei Chen
[ Exhibit Hall I ]
Abstract
Driven by large-scale model iterations, the inference speed and generalization ability of 3D model generation have improved significantly. However, the quality of existing methods still falls short of enabling direct use without post-processing. Common issues include insufficient texture clarity, loss of semantic information, lack of fine-grained detail, and the generation of redundant artifacts. Moreover, current approaches focus solely on producing static structures, where individual components remain non-movable, without considering functional applications in the generation process. To address these limitations, we draw inspiration from LEGO-like modular construction and decompose complex models into semantically functional components. We propose LEGO-Maker, a novel framework that reformulates the text-to-3D task into a three-stage process: target image generation, functional semantic decomposition, and multi-task 3D generation with structured fusion. Leveraging a reorganized high-quality 3D dataset, we train a Diffusion model and a semantic segmentation model tailored for 3D generation tasks. Additionally, we design a motion-driven mechanism to introduce action sequences for functionally interactive modules after model fusion. Experimental results demonstrate that, compared to existing methods, our approach significantly enhances semantic understanding, model detail quality, and text consistency while showcasing direct applicability across various scenarios.
Poster
Peng Cai · liqiang liqiang · Kaicheng Yang · guodong guodong · lijia lijia · zhounan zhounan · Xiang An · Ninghua Yang · Jiankang Deng
[ Exhibit Hall I ]
Abstract
Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce \textbf{For}eground-\textbf{Cen}tric \textbf{Net}work~(\textbf{ForCenNet}) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. Our training code and pre-trained models will be released to facilitate future research.
Poster
Shoubin Yu · Difan Liu · Ziqiao Ma · Yicong Hong · Yang Zhou · Hao Tan · Joyce Chai · Mohit Bansal
[ Exhibit Hall I ]
Abstract
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework.In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent.To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics.VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models …
Poster
Songlin Yang · Yushi LAN · Honghua Chen · Xingang Pan
[ Exhibit Hall I ]
Abstract
Textured 3D morphing creates smooth and plausible interpolation sequences between two 3D objects, focusing on transitions in both shape and texture. This is important for creative applications like visual effects in filmmaking. Previous methods rely on establishing point-to-point correspondences and determining smooth deformation trajectories, which inherently restrict them to shape-only morphing on untextured, topologically aligned datasets. This restriction leads to labor-intensive preprocessing and poor generalization. To overcome these challenges, we propose a method for 3D regenerative morphing using a 3D diffusion prior. Unlike previous methods that depend on explicit correspondences and deformations, our method eliminates the additional need for obtaining correspondence and uses the 3D diffusion prior to generate morphing. Specifically, we first introduce a 3D diffusion model and interpolate the source and target information at three levels: initial noise, model parameters, and condition features. We then explore an Attention Fusion strategy to generate smoother morphing sequences. To further improve the plausibility of semantic interpolation and the generated 3D surfaces, we propose two strategies: (a) Token Reordering, where we match approximate tokens based on semantic analysis to guide implicit correspondences in the denoising process of the diffusion model, and (b) Low-Frequency Enhancement, where we enhance low-frequency signals in the tokens …
Poster
Chao Zhou · Tianyi Wei · Nenghai Yu
[ Exhibit Hall I ]
Abstract
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose **Self-Adaptive Attention Scaling (SaaS)**, a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods.
Poster
Shengfang ZHAI · Jiajun Li · Yue Liu · Huanran Chen · Zhihua Tian · Wenjie Qu · Qingni Shen · Ruoxi Jia · Yinpeng Dong · Jiaheng Zhang
[ Exhibit Hall I ]
Abstract
In recent years, text-to-image (T2I) diffusion models have garnered significant attention for their ability to generate high-quality images reflecting text prompts. However, their growing popularity has also led to the emergence of backdoor threats, posing substantial risks. Currently, effective defense strategies against such threats are lacking due to the diversity of backdoor targets in T2I synthesis. In this paper, we propose NaviDet, the first general input-level backdoor detection framework for identifying backdoor inputs across various backdoor targets. Our approach is based on the new observation that trigger tokens tend to induce significant neuron activation variation in the early stage of the diffusion generation process, a phenomenon we term Early-step Activation Variation. Leveraging this insight, NaviDetdetects malicious samples by analyzing neuron activation variations caused by input tokens. Through extensive experiments, we demonstrate the effectiveness and efficiency of our method against various T2I backdoor attacks, surpassing existing baselines with significantly lower computational overhead. Furthermore, we rigorously demonstrate that our method remains effective against potential adaptive attacks.
Poster
Yi Liu · Shengqian Li · Zuzeng Lin · Feng Wang · Si Liu
[ Exhibit Hall I ]
Abstract
The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences.A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space.To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressivegeneration by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models.CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation enabling iterative refinement across scales and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass.Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios.Furthermore, both quantitativeand qualitative results indicate that CycleVAR surpasses previous state-of-the-artunsupervised image translation models, e.g., CycleGAN-Turbo.
Poster
yinhan Zhang · Yue Ma · Bingyuan Wang · Qifeng Chen · Zeyu Wang
[ Exhibit Hall I ]
Abstract
We present MagicColor, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process to color each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward. Specifically, we first propose the self-play training strategy to solve the lack of training data. Then the instance guider is introduced to feed the color of the instance. To achieve accurate color matching, we present the fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed module, MagicColor enables automatically transforming sketches into vividly-colored animations in accurate consistency with multi-reference characters.Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. Our model could even automate the colorization process, such that users …
Poster
Shengqi Dang · Yi He · Long Ling · Ziqing Qian · Nanxuan Zhao · Nan Cao
[ Exhibit Hall I ]
Abstract
Recent research shows that emotions can enhance users' cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this paper, we introduce the task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, a general emotional image generation model that generates images based on free text prompts and Valence-Arousal (V-A) values. It leverages a novel emotion-embedding mapping network to fuse V-A values into textual features, enabling the capture of emotions in alignment with intended input prompts. A novel loss function is also proposed to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.
Poster
Longfei Huang · Yu Liang · Hao Zhang · Jinwei Chen · Wei Dong · Lunde Chen · Wanyu Liu · Bo Li · Peng-Tao Jiang
[ Exhibit Hall I ]
Abstract
Recent interactive matting methods have demonstrated satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of the pre-trained U-Net within diffusion models and transform the text-driven interaction mechanism into a visual prompt-driven interaction mechanism to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of objects into U-Net, enhancing SDMatte's sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism and a visual prompt-driven interaction mechanism that enable the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Code will be made publicly available.
Poster
Kumara Kahatapitiya · Haozhe Liu · Sen He · Ding Liu · Menglin Jia · Chenyang Zhang · Michael Ryoo · Tian Xie
[ Exhibit Hall I ]
Abstract
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs)--- despite making significant headway in this context--- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that 'not all videos are created equal': meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
Poster
Gaoyang Zhang · Bingtao Fu · Qingnan Fan · Qi Zhang · Runxing Liu · Hong Gu · Huaqi Zhang · Xinguo Liu
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%).
Poster
Guangben Lu · Yuzhen N/A · Zhimin Sun · Ran Yi · Yifan Qi · Yizhe Tang · Tianyi Wang · Lizhuang Ma · FangYuan Zou
[ Exhibit Hall I ]
Abstract
Foreground-conditioned inpainting aims to seamlessly fill the background region of an image by utilizing the provided foreground subject and a text description. While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. Firstly, we design a Self-Consistent Adapter that integrates the foreground subject features into the layout-related self-attention layer, which helps to alleviate conflicts between the text and subject features by ensuring that the model can effectively consider the foreground subject's characteristics while processing the overall image layout. Secondly, we design a Decoupled Image Feature Extraction method that employs distinct architectures to extract semantic and spatial features separately, significantly improving subject feature extraction and ensuring high-quality preservation of the subject's shape. Thirdly, to ensure precise utilization of the extracted features and to focus attention on the subject region, we introduce a Shared Positional Embedding Anchor, greatly improving the model's understanding …
Poster
Qingyan Bai · Hao Ouyang · Yinghao Xu · Qiuyu Wang · Ceyuan Yang · Ka Leong Cheng · Yujun Shen · Qifeng Chen
[ Exhibit Hall I ]
Abstract
As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.
Poster
YuanFu Yang · Hsiu-Hui Hsiao
[ Exhibit Hall I ]
Abstract
This paper presents the Implicit Knowledge Distillation Diffusion Transformer (IKDDiT), a groundbreaking model tailored for photolithography overlay map generation in semiconductor manufacturing. IKDDiT effectively addresses the challenges of open-vocabulary overlay map generation by integrating pre-trained image-text encoders, diffusion models, and masked transformers. Utilizing advanced text-to-image diffusion and image-text discriminative models, it generates high-fidelity overlay maps across multiple photolithography layers, significantly mitigating overlay misregistration errors and minimizing productivity losses caused by wafer rework. Key innovations include an implicit knowledge distillation framework that refines inter-image alignment by decoupling discriminative and generative tasks via an implicit discriminator, as well as a gated cross-attention mechanism to enhance generative performance. Experimental results demonstrate that IKDDiT achieves an optimal trade-off between efficiency and accuracy, providing a scalable, robust solution poised to advance overlay map generation in semiconductor processes.
Poster
Worameth Chinchuthakun · Tossaporn Saengja · Nontawat Tritrong · Pitchaporn Rewatbowornwong · Pramook Khungurn · Supasorn Suwajanakorn
[ Exhibit Hall I ]
Abstract
While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64\% overall.
Poster
Maitreya Patel · Song Wen · Dimitris Metaxas · Yezhou Yang
[ Exhibit Hall I ]
Abstract
Despite recent advances in Rectified Flow Models (RFMs), unlocking their full potential for controlled generation tasks—such as inverse problems and image editing—remains a significant hurdle. Although RFMs and Diffusion Models (DMs) represent state-of-the-art approaches in generative modeling, their reliance on computationally demanding backpropagation through ODE solvers and inversion strategies often undermines efficiency and precision. In this paper, we present `FlowChef`, a novel training, inversion, and gradient-free inference-time steering strategy for RFMs that deterministically guides the denoising process. We first develop a theoretical and empirical understanding of the vector-field dynamics of RFMs in efficiently guiding the denoising trajectory. Specifically, leveraging the straightness and smooth Jacobian properties, we derive the mathematical relationship between gradients of rectified flow ODEs. We extend our theoretical findings to solve linear-inverse problems, image editing, classifier guidance, and many more tasks. We perform extensive evaluations and show that `FlowChef` significantly exceeds baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Remarkably, for the first time, it scales effortlessly to billion-parameter models such as Flux. We release code and demos at: https://anonymous.4open.science/r/FlowChef/
Poster
Xiaohui Li · Yihao Liu · Shuo Cao · Chen Ziyan · SHAOBIN ZHUANG · Xiangyu Chen · Yinan He · Yi Wang · Yu Qiao
[ Exhibit Hall I ]
Abstract
Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.
Poster
Le Zhuo · Liangbing Zhao · Sayak Paul · Yue Liao · Renrui Zhang · Yi Xin · Peng Gao · Mohamed Elhoseiny · Hongsheng Li
[ Exhibit Hall I ]
Abstract
Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance, and most notably, (3) reflection-level scaling, which explicitly models actionable reflections to iteratively assess and correct previously generated images. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 800K triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently fine-tune state-of-the-art diffusion transformers, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks. All code, checkpoint, and dataset will be released soon.
Poster
Teng Zhou · Xiaoyu Zhang · Yongchuan Tang
[ Exhibit Hall I ]
Abstract
Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths. Most existing methods fall in the joint diffusion paradigm, but their complex and heuristic crop connection designs often limit their ability to achieve multilevel coherence. By deconstructing this challenge into its core components, we find it naturally aligns with next-token prediction, leading us to adopt an autoregressive (AR) paradigm for PIG modeling. However, existing visual AR (VAR) models are limited to fixed-size generation, lacking the capability to produce panoramic images. In this paper, we propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm. Our approach develops a training-free strategy that utilizes token redirection to overcome the size limitations of existing VAR models, enabling next-crop prediction in both horizontal and vertical directions. This refreshes the PIG pipeline while achieving SOTA performance in coherence (47.50\%), fidelity(28.16\%), and aesthetics (15\%). Additionally, PanoLlama supports applications other PIG methods cannot achieve, including mask-free layout control, multi-scale and multi-guidance synthesis. To facilitate standardized evaluation, we also establish a dataset with 1,000 prompts spanning 100+ themes, providing a new testing benchmark for PIG research.
Poster
Fitim Abdullahu · Helmut Grabner
[ Exhibit Hall I ]
Abstract
Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.
Poster
Eunseo Koh · SeungHoo Hong · Tae-Young Kim · Jae-Pil Heo · Simon Woo
[ Exhibit Hall I ]
Abstract
Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of "Charlie Chaplin", a "mustache" consistently appears even if explicitly instructed not to include it, as the concept of "mustache" is strongly entangled with "Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt the delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing the delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative …
Poster
Junhyuk So · Juncheol Shin · Hyunho Kook · Eunhyeok Park
[ Exhibit Hall I ]
Abstract
Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likley token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7× while preserving image quality—all without requiring any additional training.
Poster
Bhishma Dedhia · David Bourgin · Krishna Kumar Singh · Yuheng Li · Yan Kang · Zhan Xu · Niraj Jha · Yuchen Liu
[ Exhibit Hall I ]
Abstract
Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40\% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality …
Poster
Aniket Roy · Shubhankar Borse · Shreya Kadambi · Debasmit Das · Shweta Mahajan · Risheek Garrepalli · Hyojin Park · Ankita Nayak · Rama Chellappa · Munawar Hayat · Fatih Porikli
[ Exhibit Hall I ]
Abstract
We tackle the challenge of jointly personalizing content and style from a few examples. A promising approach is to train separate Low-Rank Adapters (LoRA) and merge them effectively, preserving both content and style. Existing methods, such as ZipLoRA, treat content and style as independent entities, merging them by learning masks in LoRA's output dimensions. However, content and style are intertwined, not independent. To address this, we propose DuoLoRA—a content-style personalization framework featuring three key components: (1) rank-dimension mask learning, (2) effective merging via layer priors, and (3) Constyle loss, which leverages cycle-consistency in the merging process.First, we introduce ZipRank, which performs content-style merging within the rank dimension, offering adaptive rank flexibility and significantly reducing the number of learnable parameters. Additionally, we incorporate SDXL layer priors to apply implicit rank constraints informed by each layer’s content-style bias and adaptive merger initialization, enhancing the integration of content and style. To further refine the merging process, we introduce Constyle loss, which leverages the cycle consistency between content and style.Our experimental results demonstrate that DuoLoRA outperforms state-of-the-art content-style merging methods across multiple benchmarks.
Poster
Dae-Young Song · Jung-Jae Yu · Donghyeon Cho
[ Exhibit Hall I ]
Abstract
Latent diffusion models have demonstrated superior performance over traditional methods in generating highly detailed and aesthetically pleasing images, which makes them widely used for various image generation and editing tasks, including outpainting. However, most LDM-based outpainting methods impose constraints on resolution and aspect ratio, often leading to the loss of local details and blurring.One way to address these issues is progressive outpainting, where the image is extended outward incrementally. However, naive progressive outpainting suffers from two key challenges: (1) difficulty in effectively capturing global context, making it hard to maintain the original context, and (2) a tendency to generate unnatural patterns. These challenges are particularly pronounced in art, where artists pre-design the composition before painting. As a result, existing methods often introduce visual inconsistencies that distract the viewer and diminish the intended artistic emphasis. To address these limitations, we propose two types of composition planning module that enhance progressive outpainting by leveraging global structural guidance. These modules guide a pre-trained stable diffusion model to consider the overall composition, enabling realistic and contextually appropriate artwork completion without labor-intensive user prompts. Through experiments on diverse artwork images, we show the effectiveness of our proposed method both quantitatively and qualitatively.
Poster
Han Fang · Kejiang Chen · Zehua Ma · Jiajun Deng · Yicong Li · Weiming Zhang · Ee-Chien Chang
[ Exhibit Hall I ]
Abstract
Robustness is significant for generative image watermarking, typically achieved by injecting distortion-invariant watermark features. The leading paradigm, \emph{i.e.}, inversion-based framework, excels against non-geometric distortions but struggles with geometric ones. Due to the complexity of geometric distortions, finding universally geometric-invariant features is challenging, and it is not clear whether such invariant representation exists. To address this, we propose SynTag, a \textbf{syn}chronization \textbf{tag} injection-based method that enhances geometric robustness in inversion-based schemes. Instead of seeking invariant representations, we embed a sensitive template feature alongside the watermarking features. This template evolves with geometric distortions, allowing us to reconstruct the distortion trajectory for correction before extraction. Focusing on latent diffusion models, we fine-tune the VAE decoder to inject the invisible SynTag feature, pairing it with a prediction network for extraction and correction. Additionally, we introduce a dither compensation mechanism to further improve correction accuracy. SynTag is highly compatible with existing inversion-based methods. Extensive experiments demonstrate a significant boost in geometric distortion robustness while maintaining resilience against non-geometric distortions.
Poster
Hongjae Lee · Myungjun Son · Dongjea Kang · Seung-Won Jung
[ Exhibit Hall I ]
Abstract
Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.
Poster
Guibao SHEN · Luozhou Wang · Jiantao Lin · Wenhang Ge · CHAOZHE ZHANG · Xin Tao · Di ZHANG · Pengfei Wan · Guangyong Chen · Yijun Li · Ying-Cong Chen
[ Exhibit Hall I ]
Abstract
Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.
Poster
Rongyao Fang · Chengqi Duan · Kun Wang · Hao Li · Linjiang Huang · Hao Tian · Xingyu Zeng · Rui Zhao · Jifeng Dai · Hongsheng Li · Xihui Liu
[ Exhibit Hall I ]
Abstract
Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models for visual content generation. However, existing approaches face a trade-off between generation diversity and controllability, struggling to meet the varying granularity demands of different image generation tasks within a unified MLLM framework. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. PUMA achieves this by unifying multi-granular visual features as both inputs and outputs of MLLMs, thus effectively meeting the distinct granularity needs for diverse generation and precise manipulation within a single framework. Following multimodal pretraining and instruction tuning, PUMA demonstrates remarkable capabilities in a wide range of multimodal tasks, including image understanding, diverse text-to-image generation, editing, inpainting, colorization, and conditional generation. This work marks a significant stride towards realizing truly unified MLLMs capable of seamlessly adapting to the diverse granularity demands and task requirements inherent in various visual tasks. The code and model will be released upon acceptance.
Poster
Sagi Polaczek · Yuval Alaluf · Elad Richardson · Yael Vinker · Daniel Cohen-Or
[ Exhibit Hall I ]
Abstract
Vector graphics are essential in design, providing artists with a versatile medium for creating resolution-independent and highly editable visual content. Recent advancements in vision-language and diffusion models have fueled interest in text-to-vector graphics generation. However, existing approaches often suffer from over-parameterized outputs or treat the layered structure — a core feature of vector graphics — as a secondary goal, diminishing their practical use. Recognizing the importance of layered SVG representations, we propose NeuralSVG, an implicit neural representation for generating vector graphics from text prompts. Inspired by Neural Radiance Fields (NeRFs), NeuralSVG encodes the entire scene into the weights of a small MLP network, optimized using Score Distillation Sampling (SDS). To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. We additionally demonstrate that utilizing a neural representation provides an added benefit of inference-time control, enabling users to dynamically adapt the generated SVG based on user-provided inputs, all with a single learned representation. Through extensive qualitative and quantitative evaluations, we demonstrate that NeuralSVG outperforms existing methods in generating structured and flexible SVG.
Poster
Khaled Abud · Sergey Lavrushkin · Alexey Kirillov · Dmitriy Vatolin
[ Exhibit Hall I ]
Abstract
Diffusion-based models have recently revolutionized image generation, achieving unprecedented levels of fidelity. However, consistent generation of high-quality images remains challenging partly due to the lack of conditioning mechanisms for perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. We show that diffusion models can learn complex qualitative relationships from both IQA models’ outputs and internal activations. First, we experiment with gradient-based guidance to optimize image quality directly and show this method has limited generalizability. To address this, we introduce IQA-Adapter, a novel framework that conditions generation on target quality levels by learning the implicit relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter can shift the distribution of generated images towards a higher-quality subdomain, and, inversely, it can be used as a degradation model, generating progressively more distorted images when provided with a lower-quality signal. Under high-quality condition, IQA-Adapter achieves up to a 10\% improvement across multiple objective metrics, as confirmed by a user preference study, while preserving generative diversity and content. Furthermore, we extend IQA-Adapter to a reference-based conditioning scenario, utilizing the rich activation space of IQA models to transfer highly specific, …
Poster
Jiajun Luo · Lizhuo Luo · Jianru Xu · Jiajun Song · Rongwei Lu · Chen Tang · Zhi Wang
[ Exhibit Hall I ]
Abstract
Mixture-of-Experts-based (MoE-based) diffusion models demonstrate remarkable scalability in high-fidelity image generation, yet their reliance on expert parallelism introduces critical communication bottlenecks. State-of-the-art methods alleviate such overhead in parallel diffusion inference through computation-communication overlapping, termed displaced parallelism. However, we identify that these techniques inherently induce severe *staleness*-the utilization of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects critical layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these strategies effectively reduce staleness, achieving 1.26x speedup with minimal quality degradation. Empirical results establish DICE as an effective and scalable solution. Our code is publicly available at https://anonymous.4open.science/r/DICE-FF04
Poster
Jiwon Kim · Pureum Kim · SeonHwa Kim · Soobin Park · Eunju Cha · Kyong Hwan Jin
[ Exhibit Hall I ]
Abstract
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent.This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations.
Poster
Zelin Li · Ruohan Zong · Yifan Liu · Ruichen Yao · Yaokun Liu · Yang Zhang · Dong Wang
[ Exhibit Hall I ]
Abstract
With the advancement of personalized image generation technologies, concerns about forgery attacks that infringe on portrait rights and privacy are growing. To address these concerns, protection perturbation algorithms have been developed to disrupt forgery generation. However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, $\textbf{Anti-Tamper Perturbation (ATP)}$. ATP introduces a tamper-proofing mechanism within the perturbation. It consists of $\textit{protection}$ and $\textit{authorization}$ perturbations, where the protection perturbation defends against forgery attacks, while the authorization perturbation detects purification-based tampering. Both protection and authorization perturbations are applied in the frequency domain under the guidance of a mask, ensuring that the protection perturbation does not disrupt the authorization perturbation. This design also enables the authorization perturbation to be distributed across all image pixels, preserving its sensitivity to purification-based tampering.ATP demonstrates its effectiveness in defending forgery attacks across various attack settings through extensive experiments, providing a robust solution for protecting individuals' portrait rights and privacy.
Poster
Yu-Chien Liao · Jr-Jen Chen · Chi-Pin Huang · Ci-Siang Lin · Meng-Lin Wu · Yu-Chiang Frank Wang
[ Exhibit Hall I ]
Abstract
Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of $\textbf{C}$oncept $\textbf{N}$euron $\textbf{S}$election, a simple yet effective approach to perform personalization in a continual learning scheme. $\textbf{CNS}$ uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, $\textbf{CNS}$ finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that $\textbf{CNS}$ achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. $\textbf{CNS}$ also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.
Poster
Yanbing Zhang · Zhe Wang · Qin Zhou · Mengping Yang
[ Exhibit Hall I ]
Abstract
In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT's capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject's layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT's dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT's zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework …
Poster
Zhongyu Yang · Jun Chen · Dannong Xu · Junjie Fei · Xiaoqian Shen · Liangbing Zhao · Chun-Mei Feng · Mohamed Elhoseiny
[ Exhibit Hall I ]
Abstract
Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8\%-29\% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. We show some of our generated examples in \url{https://anonymous.4open.science/r/WikiAutoGen-C3C4}
Poster
Haoxuan Wang · Yuzhang Shang · Rui Xie · Junyi Wu · Junchi Yan · Yan Yan
[ Exhibit Hall I ]
Abstract
The practical deployment of diffusion models is still hindered by the high memory and computational overhead. Although quantization paves a way for model compression and acceleration, existing methods face challenges in achieving low-bit quantization efficiently. In this paper, we identify imbalanced activation distributions as a primary source of quantization difficulty, and propose to adjust these distributions through weight finetuning to be more quantization-friendly. We provide both theoretical and empirical evidence supporting finetuning as a practical and reliable solution. Building on this approach, we further distinguish two critical types of quantized layers: those responsible for retaining essential temporal information and those particularly sensitive to bit-width reduction. By selectively finetuning these layers under both local and global supervision, we mitigate performance degradation while enhancing quantization efficiency.Our method demonstrates its efficacy across three high-resolution image generation tasks, obtaining state-of-the-art performance across multiple bit-width settings.
Poster
Feihong Yan · qingyan wei · Jiayi Tang · Jiajun Li · Yulin Wang · Xuming Hu · Huiqi Li · Linfeng Zhang
[ Exhibit Hall I ]
Abstract
Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy:Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83× acceleration with almost no drop in generation quality. Our codes have been released in the supplementary material and will be released in Github.
Poster
XIN Hu · Ke Qin · Guiduo Duan · Ming Li · Yuan-Fang Li · Tao He
[ Exhibit Hall I ]
Abstract
Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction.Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework---a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaption, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed-set and open-set scenarios, particularly excelling in spatial relationship prediction. The code is available at: https://anonymous.4open.science/r/SPADE-105F.
Poster
ZHANG YINGWEN · Meng Wang · Xihua Sheng · Peilin CHEN · Junru Li · Li Zhang · Shiqi Wang
[ Exhibit Hall I ]
Abstract
Lossy image compression networks aim to minimize the latent entropy of images while adhering to specific distortion constraints. However, optimizing the neural network can be challenging due to its nature of learning quantized latent representations. In this paper, our key finding is that minimizing the latent entropy is, to some extent, equivalent to maximizing the conditional source entropy, an insight that is deeply rooted in information-theoretic equalities. Building on this insight, we propose a novel structural regularization method for the neural image compression task by incorporating the negative conditional source entropy into the training objective, such that both the optimization efficacy and the model's generalization ability can be promoted. The proposed information-theoretic regularizer is interpretable, plug-and-play, and imposes no inference overheads. Extensive experiments demonstrate its superiority in regularizing the models and further squeezing bits from the latent representation across various compression structures and unseen domains.
Poster
Runze Zhang · Guoguang Du · Xiaochuan Li · Qi Jia · Liang Jin · Lu Liu · Jingjing Wang · Cong Xu · Zhenhua Guo · Yaqian Zhao · Xiaoli Gong · Rengang Li · Baoyu Fan
[ Exhibit Hall I ]
Abstract
Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo …
Poster
Ce Wang · Zhenyu Hu · Wanjie Sun · Zhenzhong Chen
[ Exhibit Hall I ]
Abstract
Image rescaling aims to learn the optimal low-resolution (LR) image that can be accurately reconstructed to its original high-resolution (HR) counterpart, providing an efficient image processing and storage method for ultra-high definition media. However, extreme downscaling factors pose significant challenges to the upscaling process due to its highly ill-posed nature, causing existing image rescaling methods to struggle in generating semantically correct structures and perceptual friendly textures. In this work, we propose a novel framework called Timestep-Aware Diffusion Model (TADM) for extreme image rescaling, which performs rescaling operations in the latent space of a pre-trained autoencoder and effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model. Specifically, TADM adopts a pseudo-invertible module to establish the bidirectional mapping between the latent features of the HR image and the target-sized LR image. Then, the rescaled latent features are enhanced by a pre-trained diffusion model to generate more faithful details. Considering the spatially non-uniform degradation caused by the rescaling operation, we propose a novel time-step alignment strategy, which can adaptively allocate the generative capacity of the diffusion model based on the quality of the reconstructed latent features. Extensive experiments demonstrate the superiority of TADM over previous methods in both quantitative …
Poster
Aoxiong Yin · Kai Shen · Yichong Leng · Xu Tan · Xinyu Zhou · Juncheng Li · Siliang Tang
[ Exhibit Hall I ]
Abstract
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://anonymoust2v.github.io/ .
Poster
Zhen Zhang · Zhen Zhang · Qianlong Dang · Zhize Wu · LiChuan Gu
[ Exhibit Hall I ]
Abstract
Single domain generalization aims to learn a model with good generalization ability from a single source domain. Recent advances in this field have focused on increasing the diversity of the training data through style (e.g., color and texture) augmentation. However, most existing methods apply uniform perturbations to the entire image, failing to simulate complex images with multiple distinct stylistic regions. To address this, we propose a ``Split-And-Combine" (SAC) strategy to enhance style diversity. Specifically, SAC first performs patch-aware augmentation, which splits an image into multiple patches and applies style augmentation independently to each patch, enabling distinct color variations across regions. Then, SAC combines these patches to reconstruct a complete image and applies adaptive random convolutions, which utilizes a deformable convolution layer with random and Gaussian filters to enhance texture diversity while preserving object integrity. Notably, SAC leverages entropy as a risk assessment criterion to adaptively determine whether a sample should undergo augmentation within the iterative process of random convolutions, preventing excessive augmentation. Furthermore, SAC introduces an energy-based distribution discrepancy score to quantify out-of-distribution likelihood, systematically expanding the augmented data's distribution. SAC can serve as a plug-and-play component to improve the performance of recent methods. Extensive experiments on four datasets demonstrate …
Poster
Wonwoong Cho · Yan-Ying Chen · Matthew Klenk · David I. Inouye · Yanxia Zhang
[ Exhibit Hall I ]
Abstract
Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the **Attribute (Att) Adapter**, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning.We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world.Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
Poster
Jiale Cheng · Ruiliang Lyu · Xiaotao Gu · Xiao Liu · Jiazheng Xu · Yida Lu · Jiayan Teng · Zhuoyi Yang · Yuxiao Dong · Jie Tang · Hongning Wang · Minlie Huang
[ Exhibit Hall I ]
Abstract
Video generation models have made remarkable progress in recent years, demonstrating outstanding performance in text-to-video tasks.These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured.This gap makes prompt optimization crucial for generating high-quality videos.Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks.Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results.To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness.The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos.To achieve this, VPO employs a two-stage optimization approach.First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to traditional prompt optimization methods, such …
Poster
Bimsara Pathiraja · Maitreya Patel · Shivam Singh · Yezhou Yang · Chitta Baral
[ Exhibit Hall I ]
Abstract
Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce **`RefEdit-Bench`**, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly.To overcome this limitation, we introduce **`RefEdit`** -- an instruction-based editing model trained on our scalable synthetic data generation pipeline.Our **`RefEdit`**, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods.We will release our code, data, and checkpoints.
Poster
Shufan Li · Konstantinos Kallidromitis · Akash Gokul · Arsh Koneru · Yusuke Kato · Kazuki Kozuka · Aditya Grover
[ Exhibit Hall I ]
Abstract
The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using …
Poster
Wen Qian
[ Exhibit Hall I ]
Abstract
Diffusion techniques has significantly advanced the development of virtual try-on. However, these methods often struggle to preserve intricate details, such as patterns, texts and faces, etc. To tackle this challenge, we introduce a plug-and-play module named as "TryOn-Refiner", which can refine the detailed artifacts for any try-on results in only $1\sim10$ steps.Instead of the previous diffusion-based refine module, TryOn-Refiner employs the conditional rectified-flow-based mechanism for better leveraging prior information from coarse try-on results. Specifically, TryOn-Refiner transforms the traditional refinement framework from a noise-to-image paradigm into a flow mapping framework that directly maps coarse images to refined images, essentially avoiding introducing uncertainty in the refinement process.Moreover, we propose a training data construction pipeline, which can efficiently generate paired training data and includes a data smoothing strategy to overcome the blocking artifact.Extended experimental results demonstrate our TryOn-Refiner consistently improves performance with only a few inference steps for all evaluated existing try-on methods.
Poster
Jianhong Bai · Menghan Xia · Xiao Fu · Xintao Wang · Lianrui Mu · Jinwen Cao · Zuozhu Liu · Haoji Hu · Xiang Bai · Pengfei Wan · Di ZHANG
[ Exhibit Hall I ]
Abstract
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through an elegant yet powerful video conditioning mechanism—an aspect often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset will be publicly available.
Poster
Xianglong He · Zi-Xin Zou · Chia Hao Chen · Yuan-Chen Guo · Ding Liang · Chun Yuan · Wanli Ouyang · Yanpei Cao · Yangguang Li
[ Exhibit Hall I ]
Abstract
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation …
Poster
Aniket Rege · Zinnia Nie · Unmesh Raskar · Mahesh Ramesh · Zhuoran Yu · Aditya Kusupati · Yong Jae Lee · Ramya Vinayak
[ Exhibit Hall I ]
Abstract
Popular text-to-image (T2I) models are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to text-to-image systems as a proxy for human judgments. Our CuRe dataset has a novel categorical hierarchy that enables benchmarking T2I systems in this manner, with 32 cultural subcategories across six broad cultural axes (food, art, fashion, architecture, celebrations, and people), built from the crowdsourced Wikimedia knowledge graph. Unlike flawed existing benchmarks, which suffer from ``generative entanglement'' due to overlapping training and evaluation data, CuRe enables fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP2, AIMv2 and DINOv2), image-text models (CLIP, SigLIP) and state-of-the-art text-to-image systems including Stable Diffusion 3.5 Large and Flux.1. Code and benchmark dataset is available at: \textbf{hidden for double blind}
Poster
Yu Cheng · Fajie Yuan
[ Exhibit Hall I ]
Abstract
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose \textbf{LeanVAE}, a novel and ultra-efficient Video VAE framework that introduce two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE’s superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50× fewer FLOPs and 44× faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code will be made publicly available.
Poster
Felix Krause · Timy Phan · Ming Gui · Stefan A. Baumann · Vincent Tao Hu · Björn Ommer
[ Exhibit Hall I ]
Abstract
Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place still remains very costly. While several recent approaches—including masking, distillation, and architectural modifications—have been proposed to improve training efficiency, each of these methods comes with its own tradeoffs: some achieve enhanced performance at the expense of increased computational cost. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces the computational cost and simultaneously boosts model performance on the standard benchmark ImageNet-256 in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT …
Poster
Zeyi Sun · Tong Wu · Pan Zhang · Yuhang Zang · Xiaoyi Dong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang
[ Exhibit Hall I ]
Abstract
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D data with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates filtered multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering data and rewriting inaccurate captions. Leveraging this pipeline, we have generated large scale synthetic multi-view images with dense descriptive captions. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and view consistency.
Poster
Shiyu Qin · Jinpeng Wang · Yimin Zhou · Bin Chen · Tianci Luo · Baoyi An · Tao Dai · Shu-Tao Xia · Yaowei Wang
[ Exhibit Hall I ]
Abstract
Learned image compression (LIC) demonstrates superior rate-distortion (RD) performance compared to traditional methods. Recent method MambaVC attempts to introduce Mamba, a variant of state space models, into this field aim to establish a new paradigm beyond convolutional neural networks and transformers. However, this approach relies on predefined four-directional scanning, which prioritizes spatial proximity over content and semantic relationships, resulting in suboptimal redundancy elimination. Additionally, it focuses solely on nonlinear transformations, neglecting entropy model improvements crucial for accurate probability estimation in entropy coding. To address these limitations, we propose a novel framework based on content-adaptive visual state space model, Cassic, through dual innovation.First, we design a content-adaptive selective scan based on weighted activation maps and bit allocation maps, subsequently developing a content-adaptive visual state space block. Second, we present a mamba-based channel-wise auto-regressive entropy model to fully leverage inter-slice bit allocation consistency for enhanced probability estimation. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across three datasets while maintaining faster processing speeds than existing MambaVC approach.
Poster
Xuan Ju · Weicai Ye · Quande Liu · Qiulin Wang · Xintao Wang · Pengfei Wan · Di ZHANG · Kun Gai · Qiang Xu
[ Exhibit Hall I ]
Abstract
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they suffer from three key limitations: **branch conflicts** between independently trained adapters, **parameter redundancy** leading to increased computational cost, and **suboptimal performance** compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions—including text, camera, identities, and depth—via full-attention mechanisms. By directly fusing multimodal conditions into a unified sequence representation, FullDiT significantly reduces parameter overhead, avoids conflicts common in adapter-based methods, and shows scalability and emergent ability. We further introduce FullBench, a new benchmark designed specifically for multi-condition video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of unified full-attention in complex multimodal video tasks.
Poster
Rishubh Parihar · Sachidanand VS · Venkatesh Babu Radhakrishnan
[ Exhibit Hall I ]
Abstract
Diffusion models have transformed image editing but struggle with precise depth-aware control, such as placing objects at a specified depth. Layered representations offer fine-grained control by decomposing an image into separate editable layers. However, existing methods simplistically represent a scene via a set of background and transparent foreground layers while ignoring the scene geometry - limiting their effectiveness for depth-aware editing. We propose \textbf{D}epth-\textbf{G}uided \textbf{L}ayer \textbf{D}ecomposition - a layering method that decomposes an image into foreground and background layers based on a \textbf{user-specified depth value}, enabling precise depth-aware edits. We further propose \textbf{F}eature \textbf{G}uided \textbf{L}ayer \textbf{C}ompositing - a zero-shot approach for realistic layer compositing by leveraging generative priors from pretrained diffusion models. Specifically, we guide the internal U-Net features to progressively fuse individual layers into a composite latent at each denoising step. This preserves the structure of individual layers while generating realistic outputs with appropriate color and lighting adjustments without a need for post-hoc harmonization models. We demonstrate our method on two key depth-aware editing tasks: \textbf{1)} scene compositing by blending the foreground of one scene with the background of another at a specified depth, and; \textbf{2)} object insertion at a user-defined depth. Our zero-shot approach achieves precise depth ordering …
Poster
Jaeseok Jeong · Junho Kim · Youngjung Uh · Gayoung Lee · Yunjey Choi
[ Exhibit Hall I ]
Abstract
In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts.Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts.
Poster
Srikumar Sastry · Aayush Dhakal · Eric Xing · Subash Khanal · Nathan Jacobs
[ Exhibit Hall I ]
Abstract
Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models will be open-sourced.
Poster
Sucheng Ren · Qihang Yu · Ju He · Xiaohui Shen · Alan Yuille · Liang-Chieh (Jay) Chen
[ Exhibit Hall I ]
Abstract
Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference.In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias.As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing.On ImageNet-256 generation benchmark, our base model, …
Poster
Zijun Zhou · Yingying Deng · Xiangyu He · Weiming Dong · Fan Tang
[ Exhibit Hall I ]
Abstract
Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.
Poster
Xinbo Wang · Wenju Xu · Qing Zhang · Wei-Shi Zheng
[ Exhibit Hall I ]
Abstract
This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transformation to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transformation, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model will be made publicly available.
Poster
Zhu Xu · Ting Lei · Zhimin Li · Guan Wang · Qingchao Chen · Yuxin Peng · Yang Liu
[ Exhibit Hall I ]
Abstract
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced In-domain Knowledge Transferring (TIKT) method, which leverages in-domain knowledge to enhance detection in relation-aware dynamic scenarios. TIKT is built on two key components: (1)In-domain knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas, facilitating attention maps relation-aware. Then we propose an Inter-frame Attention Augmentation strategy that exploits neighboring frames and optical flow information to enhance these attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware in-domain knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention …
Poster
Yuran Dong · Mang Ye
[ Exhibit Hall I ]
Abstract
To advance real-world fashion image editing, we analyze existing two-stage pipelines—mask generation followed by diffusion-based editing—which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories).To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence→Stabilization→Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.
Poster
Baoyue Hu · Yang Wei · Junhao Xiao · Wendong Huang · Xiuli Bi · Bin Xiao
[ Exhibit Hall I ]
Abstract
To defend against personalized generation, a new form of infringement that is more concealed and destructive, the existing copyright protection methods is to add adversarial perturbations in images. However, these methods focus solely on countering illegal personalization, neglecting the requirement for legitimate personalization. Moreover, none of these methods are capable of directly verifying and tracing the copyright from adversarial examples. In response to these limitations, we propose a traceable and authorizable copyright traceability method that embeds the copyright watermark into images through a series of invertible compound coupling modules. We introduce a novel information exchange mechanism for invertible neural network and design a contrastive learning-based optimization strategy tailored to address personalized infringement issues. Our method effectively mitigates the malicious use of unauthorized personalized generation models by inducing watermark-like artifacts and obscuring privacy details in generated images. Additionally, it facilitates copyright traceability and supports authorized legitimate personalization, thereby offering broader practical applicability. Experimental results demonstrate that our method can almost losslessly restore original image and extract copyright watermark, while achieving FID scores exceeding 300 and causing visually noticeable artifacts in unauthorized personalized images. Furthermore, it exhibits consistent robustness against adversarial purification and text prompt modifications.
Poster
Yuanhui Huang · Weiliang Chen · Wenzhao Zheng · Yueqi Duan · Jie Zhou · Jiwen Lu
[ Exhibit Hall I ]
Abstract
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters.
Poster
Jiacheng Liu · Chang Zou · Yuanhuiyi Lyu · Junjie Chen · Linfeng Zhang
[ Exhibit Hall I ]
Abstract
Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications.To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps.However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality.To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps.Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios.For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration.Our code is provided in the supplementary materials and will be made publicly available on GitHub.
Poster
Jiahao Zhu · Zixuan Chen · Guangcong Wang · Xiaohua Xie · Yi Zhou
[ Exhibit Hall I ]
Abstract
Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation.However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results.To address this issue, we present \textbf{SegmentDreamer}, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation.Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency.Moreover, \textbf{SCTD} partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error.Additionally, we propose a distillation pipeline for a more swift and stable generation.Extensive experiments demonstrate that our \textbf{SegmentDreamer} outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).
Poster
Ho Kei Cheng · Alex Schwing
[ Exhibit Hall I ]
Abstract
Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short.This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training.In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution.This gap between training and testing leads to a subpar performance.To bridge this gap, we propose conditional optimal transport (C$^2$OT) that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians$\to$moons, CIFAR-10, ImageNet-32$\times$32, and ImageNet-256$\times$256.Our method performs better overall compared to the existing baselines across different function evaluation budgets.Code will be made available.
Poster
Fei Peng · Junqiang Wu · Yan Li · Tingting Gao · Di ZHANG · Huiyuan Fu
[ Exhibit Hall I ]
Abstract
Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and models will be made publicly available.
Poster
Wenjie Xuan · Jing Zhang · Juhua Liu · Bo Du · Dacheng Tao
[ Exhibit Hall I ]
Abstract
Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet (SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable text-to-image generation approaches under the sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals.Codes will be released.
Poster
Sunghyun Park · Seokeon Choi · Hyoungwoo Park · Sungrack Yun
[ Exhibit Hall I ]
Abstract
Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation.However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability).Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt.Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference.Unlike existing guidance methods, which rely solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.
Poster
Roi Benita · Michael Finkelson · Tavi Halperin · Gleb Sterkin · Yossi Adi
[ Exhibit Hall I ]
Abstract
Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.
Poster
Ju-Hyeon Nam · Dong-Hyun Moon · Sang-Chul Lee
[ Exhibit Hall I ]
Abstract
Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map—a curvature metric indicating the difficulty of forgery localization—which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.
Poster
Jimyeong Kim · Jungwon Park · Yeji Song · Nojun Kwak · Wonjong Rhee
[ Exhibit Hall I ]
Abstract
Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.
Poster
YINWEI WU · Xianpan Zhou · bing ma · Xuefeng Su · Kai Ma · Xinchao Wang
[ Exhibit Hall I ]
Abstract
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. To address this Instance Feature Generation (IFG) task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process in a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models’ abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
Poster
Qian Wang · Aleksandar Cvejic · Abdelrahman Eldesokey · Peter Wonka
[ Exhibit Hall I ]
Abstract
We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.
Poster
Nataniel Ruiz · Yuanzhen Li · Neal Wadhwa · Yael Pritch · Michael Rubinstein · David Jacobs · Shlomi Fruchter
[ Exhibit Hall I ]
Abstract
We present Magic Insert, a method to drag-and-drop subjects from a user-provided image into a target image of a different style in a plausible manner while matching the style of the target image. This work formalizes our version of the problem of style-aware drag-and-drop and proposes to tackle it by decomposing it into two sub-problems: style-aware personalization and realistic object insertion in stylized images. For style-aware personalization, we cast our method as a weight-and-text-embedding finetuning method with inference-time module-targeted style injection. For subject insertion, we propose Bootstrapped Domain Adaption (BDA) to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method significantly outperforms traditional and state-of-the-art approaches that struggle with quality, subject fidelity and harmonious stylization. Finally, we present a new dataset, SubjectPlop, to facilitate evaluation and future progress in this area.
Poster
yifei xia · Suhan Ling · Fangcheng Fu · Yujie Wang · Huixia Li · Xuefeng Xiao · Bin CUI
[ Exhibit Hall I ]
Abstract
Generating high-quality long videos with Diffusion Transformers (DiTs) faces significant latency due to computationally intensive attention mechanisms. For instance, generating an 8s 720p video (110K tokens) with HunyuanVideo requires around 600 PFLOPs, with attention computations consuming about 500 PFLOPs.To tackle this, we propose **AdaSpa**, the first **Dynamic Pattern** and **Online Precise Search** sparse attention method for DiTs. First, AdaSpa uses a blockified pattern to efficiently represent the hierarchical sparsity inherent in DiTs, significantly reducing attention complexity while preserving video fidelity. This is motivated by our observation that DiTs' sparsity exhibits hierarchical and blockified structures across modalities.Second, AdaSpa introduces Fused LSE-Cached Search with Head-Adaptive Block Sparse Attention for efficient online precise search and computation. This approach leverages the invariance of sparse patterns and LSE across denoising steps, allowing precise real-time identification of sparse patterns with minimal overhead.AdaSpa is an **adaptive, plug-and-play solution** that seamlessly integrates into existing DiT models without additional training or data profiling. Extensive experiments validate that AdaSpa significantly accelerates video generation from 1.59$\times$ to 2.04$\times$ while maintaining video quality, demonstrating strong effectiveness.
Poster
Rohit Gandikota · Zongze Wu · Richard Zhang · David Bau · Eli Shechtman · Nicholas Kolkin
[ Exhibit Hall I ]
Abstract
We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines.
Poster
Yi Huang · Wei Xiong · He Zhang · Chaoqi Chen · Jianzhuang Liu · Mingfu Yan · Shifeng Chen
[ Exhibit Hall I ]
Abstract
Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject’s identity. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing.
Poster
Bin Fu · Zixuan Wang · Kainan Yan · Shitian Zhao · Qi Qin · Jie Wen · Junjun He · Peng Gao
[ Exhibit Hall I ]
Abstract
Few-shot font generation (FFG) aims to create new font images by imitating the style from a limited set of reference images, while maintaining the content from the source images. Although this task has achieved significant progress, most existing methods still suffer from the incorrect generation of complicated character structure and detailed font style. To address the above issues, in this paper, we regard font generation as a font transfer process from the source font to the target font, and construct a video generation framework to model this process. Moreover, a test-time condition alignment mechanism is further developed to enhance the consistency between the generated samples and the provided condition samples. Specifically, we first construct a diffusion-based image-to-image font generation framework for the few-shot font generation task. This framework is expanded into an image-to-video font generation framework by integrating temporal components and frame-index information, enabling the production of high-quality font videos that transition from the source font to the target font. Based on this framework, we develop a noise inversion mechanism in the generative process to perform content and style alignment between the generated samples and the provided condition samples, enhancing style consistency and structural accuracy. The experimental results show that …
Poster
Jeongho Kim · Hoiyeong Jin · Sunghyun Park · Jaegul Choo
[ Exhibit Hall I ]
Abstract
Recent virtual try-on approaches have advanced by fine-tuning pre-trained text-to-image diffusion models to leverage their powerful generative ability; however, the use of text prompts in virtual try-on remains underexplored. This paper tackles a text-editable virtual try-on task that modifies the clothing based on the provided clothing image while editing the wearing style (e.g., tucking style, fit) according to the text descriptions. In the text-editable virtual try-on, three key aspects exist: (i) designing rich text descriptions for paired person-clothing data to train the model, (ii) addressing the conflicts where textual information of the existing person's clothing interferes the generation of the new clothing, and (iii) adaptively adjusting the inpainting mask aligned with the text descriptions, ensuring proper editing areas while preserving the original person's appearance irrelevant to the new clothing. To address these aspects, we propose PromptDresser, a text-editable virtual try-on model that leverages large multimodal model (LMM) assistance to enable high-quality and versatile manipulation based on generative text prompts. Our approach utilizes LMMs via in-context learning to generate detailed text descriptions for person and clothing images independently, including pose details and editing attributes using minimal human cost. Moreover, to ensure the editing areas, we adjust the inpainting mask depending on …
Poster
Fengyuan Shi · Zhuoyan Luo · Yixiao Ge · Yujiu Yang · Ying Shan · Limin Wang
[ Exhibit Hall I ]
Abstract
Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on reconstruction and the application of autoregressive visual generation.
Poster
JiaKui Hu · Zhengjian Yao · Lujia Jin · Hangzhou He · Yanye Lu
[ Exhibit Hall I ]
Abstract
Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.
Poster
Mengyu Wang · Henghui Ding · Jianing Peng · Yao Zhao · Yunpeng Chen · Yunchao Wei
[ Exhibit Hall I ]
Abstract
In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems.First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background.CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes.Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios.
Poster
Jiahao Wang · Ning Kang · Lewei Yao · Mengzhao Chen · Chengyue Wu · Songyang Zhang · Shuchen Xue · Yong Liu · Taiqiang Wu · Xihui Liu · Kaipeng Zhang · Shifeng Zhang · Wenqi Shao · Zhenguo Li · Ping Luo
[ Exhibit Hall I ]
Abstract
In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention. In class-conditional 256×256 and 512×512 ImageNet generation, LiT can be quickly adapted from DiT using only 20% and 33% of DiT’s training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize …
Poster
Jiayi Guo · Chuanhao Yan · Xingqian Xu · Yulin Wang · Kai Wang · Gao Huang · Humphrey Shi
[ Exhibit Hall I ]
Abstract
Ensuring precise alignments between diffusion-generated images and input prompts is a long-term challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be open-sourced.
Poster
Paschalis Giakoumoglou · Dimitrios Karageorgiou · Symeon Papadopoulos · Panagiotis Petrantonakis
[ Exhibit Hall I ]
Abstract
Recent advancements in generative AI have made text-guided image inpainting—adding, removing, or altering image regions using textual prompts—widely accessible. However, generating semantically correct photorealistic imagery, typically requires carefully-crafted prompts and iterative refinement by evaluating the realism of the generated content - tasks commonly performed by humans. To automate the generative process, we propose Semantically Aligned and Uncertainty Guided AI Image Inpainting (SAGI), a model-agnostic pipeline, to sample prompts from a distribution that closely aligns with human perception and to evaluate the generated content and discard one that deviates from such a distribution, which we approximate using pretrained Large Language Models and Vision-Language Models. By applying this pipeline on multiple state-of-the-art inpainting models, we create the SAGI Dataset (SAGI-D), currently the largest and most diverse dataset of AI-generated inpaintings, comprising over 95k inpainted images and a human-evaluated subset. Our experiments show that semantic alignment significantly improves image quality and aesthetics, while uncertainty guidance effectively identifies realistic manipulations — human ability to identify inpainted images from real ones drops from 74\% to 35\% in terms of accuracy, after applying our pipeline. Moreover, using SAGI-D for training several image forensic approaches increases in-domain detection performance on average by 37.4\% and out-of-domain generalization by …
Poster
Ao Ma · Jiasong Feng · Ke Cao · Jing Wang · Yun Wang · Quanwei Zhang · Zhanjie Zhang
[ Exhibit Hall I ]
Abstract
Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject's position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject's position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject’s position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Toggable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion …
Poster
Zhenyu Yan · Jian Wang · Aoqiang Wang · Yuhan Li · Wenxiang Shang · Zhu Hangcheng
[ Exhibit Hall I ]
Abstract
In image editing tasks,high-quality text editing capabilities can significantly reduce both human and material resource costs.Existing methods, however,face significant limitations in terms of stroke accuracy for complex text and controllability of generated text styles.To address these challenges,we propose TextMaster,a solution capable of accurately editing text across various scenarios and image regions,while ensuring proper layout and controllable text style.Our approach incorporates adaptive standard letter spacing as guidance during training and employs adaptive mask boosting to prevent the leakage of text position and size information.By leveraging an attention mechanism to compute the intermediate layer bounding box regression loss for each character,our method enables the learning of text layout across diverse contexts.Additionally,we enhance text rendering accuracy and fidelity by injecting high-resolution standard font information and applying perceptual loss within the text editing region.Through a novel style injection technique, we achieve controllable style transfer for the injected text.Through comprehensive experiments,we demonstrate the state-of-the-art performance of our method.
Poster
Alakh Desai · Nuno Vasconcelos
[ Exhibit Hall I ]
Abstract
Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (*CFG*) to develop a sampling procedure, *Adaptive Negative Sampling Without External Resources* (*ANSWER*), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. *ANSWER* is a training-free technique, applicable to any model that supports *CFG*, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding *ANSWER* to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.
Poster
Donald Shenaj · Ondrej Bohdal · Mete Ozay · Pietro Zanuttigh · Umberto Michieli
[ Exhibit Hall I ]
Abstract
Recent advancements in image generation models have enabled personalized image creation with both user-defined subjects (content) and styles. Prior works achieved personalization by merging corresponding low-rank adapters (LoRAs) through optimization-based methods, which are computationally demanding and unsuitable for real-time use on resource-constrained devices like smartphones. To address this, we introduce LoRA.rar, a method that not only improves image quality but also achieves a remarkable speedup of over $4000\times$ in the merging process. We collect a dataset of style and subject LoRAs and pre-train a hypernetwork on a diverse set of content-style LoRA pairs, learning an efficient merging strategy that generalizes to new, unseen content-style pairs, enabling fast, high-quality personalization. Moreover, we identify limitations in existing evaluation metrics for content-style quality and propose a new protocol using multimodal large language models (MLLMs) for more accurate assessment. Our method significantly outperforms the current state of the art in both content and style fidelity, as validated by MLLM assessments and human evaluations.
Poster
Andy Regensky · Marc Windsheimer · Fabian Brand · Andre Kaup
[ Exhibit Hall I ]
Abstract
Neural video codecs (NVCs) have seen fast-paced advancement in recent years and already perform close to state-of-the-art traditional video codecs like H.266/VVC. However, NVC investigations have so far focused on improving performance for classical perspective video leaving the increasingly important 360-degree video format unexplored. In this paper, we address this issue and present how existing NVCs can be optimized for 360-degree video while also improving performance on perspective video. As no suitable datasets for neural 360-degree video compression exist, we publish a large-scale 360-degree video dataset consisting of more than 6000 user generated 9-frame sequences with resolutions ranging from 0.5K to 8K. We propose a novel method for training data augmentation exploiting the spherical characteristics of 360-degree video that shows to be crucial for achieving maximum compression performance. An additional positional feature encoding further supports the NVC in dynamic bitrate allocation notably improving the performance for both 360-degree and perspective video. Overall, we achieve rate savings of almost 8% for 360-degree video and more than 3% for perspective video with minimal complexity overhead. The dataset is available at: {link will be provided upon acceptance}. Source code and pre-trained model weights are available at: {link will be provided upon acceptance}.
Poster
Chuanwei Huang · Zexi Jia · Hongyan Fei · Yeshuang Zhu · Zhiqiang Yuan · Ying Deng · Jiapei Zhang · Xiaoyue Duan · Jinchao Zhang · Jie Zhou
[ Exhibit Hall I ]
Abstract
With the rapid advancement of generative models, we can now create highly realistic images. This represents a significant technical breakthrough but also introduces new challenges for copyright protection. Previous methods for detecting copyright infringement in AI-generated images mainly depend on global similarity. However, real-world infringement often occurs only on certain attributes rather than being a global infringement. To address these challenges, we propose a novel Multi-aspect Copyright Infringement Detection (MCID) task, which encompasses various types of infringement, including content, style, structure, and intellectual property infringement. We further develop the Hybrid Infringement Detection Model (HIDM) to address the MCID task. By combining feature-based methods with VLMs, it enables the detection of various infringement types and provides interpretable results. To ensure the MCID task meets actual legal requirements, we construct a Large-Scale Copyright Dataset (LSCD) with clear author copyright ownership. Based on LSCD, we provide a benchmark annotated by legal experts for performance evaluation. Experimental results show that HIDM effectively detects various types of image copyright infringement and offers a more interpretable and superior solution compared to previous methods.
Poster
Yuanhao Zhai · Yen-Liang Lin · Minxu Peng · Larry Davis · Ashwin Chandramouli · Junsong Yuan · David Doermann
[ Exhibit Hall I ]
Abstract
Existing outfit recommendation frameworks mainly focus on outfit compatibility prediction and complementary item retrieval. However, the outfit items are predicted by the pre-trained model and can not be controlled by the text prompt. We present a text-driven outfit generation framework, Text2Outfit, which generates outfits controlled by the text prompt. Our framework supports two forms of outfit recommendation: 1) text-to-outfit generation, which retrieves the outfits given the prompt, where the prompt includes the specification of the entire outfit (e.g., occasion or season) and the individual outfit items (e.g., product feature), and 2) seed-to-outfit generation, which additionally uses a seed item (image or item descriptions) as input and retrieves items to build outfits. We train a large language model framework (LLM) to predict a set of embeddings to retrieve outfit items. We devise an attention masking mechanism in LLM to handle the alignment between the outfit text descriptions in the prompt and the image tokens from different categories. We conducted the experiments on the Poylvore data set and evaluated outfit retrieval performance from two perspectives: 1) feature matching for outfit items and 2) outfit compatibility. The results show that our approach achieves significantly better performance than the baseline approaches for text to …
Poster
Hailing Wang · Jianglin Lu · Yitian Zhang · Yun Fu
[ Exhibit Hall I ]
Abstract
Quantization techniques, including quantization-aware training (QAT) and post-training quantization (PTQ), have become essential for inference acceleration of image super-resolution (SR) networks. Compared to QAT, PTQ has garnered significant attention as it eliminates the need for ground truth and model retraining. However, existing PTQ methods for SR often fail to achieve satisfactory performance as they overlook the impact of outliers in activation. Our empirical analysis reveals that these prevalent activation outliers are strongly correlated with image color information, and directly removing them leads to significant performance degradation. Motivated by this, we propose a dual-region quantization strategy that partitions activations into an outlier region and a dense region, applying uniform quantization to each region independently to better balance bit-width allocation. Furthermore, we observe that different network layers exhibit varying sensitivities to quantization, leading to different levels of performance degradation. To address this, we introduce sensitivity-aware finetuning that encourages the model to focus more on highly sensitive layers, further enhancing quantization performance. Extensive experiments demonstrate that our method outperforms existing PTQ approaches across various SR networks and datasets, while achieving performance comparable to QAT methods in most scenarios with at least a 75 $\times$ speedup.
Poster
Junsong Chen · Shuchen Xue · Yuyang Zhao · Jincheng YU · Sayak Paul · Junyu Chen · Han Cai · Enze Xie · Song Han
[ Exhibit Hall I ]
Abstract
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4.We introduce three key innovations: $\textbf{(1)}$ We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. $\textbf{(2)}$ SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. $\textbf{(3)}$ We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in just 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024$\times$1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and …
Poster
Kasra Arabi · R. Teal Witter · Chinmay Hegde · Niv Cohen
[ Exhibit Hall I ]
Abstract
Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques.Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection.In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing.Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving …
Poster
Giuseppe Cartella · Vittorio Cuculo · Alessandro D'Amelio · Marcella Cornia · Giuseppe Boccignone · Rita Cucchiara
[ Exhibit Hall I ]
Abstract
Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models will be made publicly available.
Poster
Lorenzo Baraldi · Davide Bucciarelli · Federico Betti · Marcella Cornia · Lorenzo Baraldi · Nicu Sebe · Rita Cucchiara
[ Exhibit Hall I ]
Abstract
Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We will publicly release our source code, models, and data.
Poster
Haoxuan Li · Ziya Erkoç · Lei Li · Daniele Sirigatti · Vladislav Rosov · Angela Dai · Matthias Nießner
[ Exhibit Hall I ]
Abstract
We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-designed triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into ‘deletion’ of regions of a mesh, followed by ‘addition’ of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.
Poster
Kwanyoung Kim · Byeongsu Sim
[ Exhibit Hall I ]
Abstract
Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.
Poster
Zongyu Lin · Wei Liu · Chen Chen · Jiasen Lu · Wenze Hu · Tsu-Jui Fu · Jesse Allardice · Zhengfeng Lai · Liangchen Song · Bowen Zhang · cha chen · Yiran Fei · Lezhi Li · Yizhou Sun · Kai-Wei Chang · Yinfei Yang
[ Exhibit Hall I ]
Abstract
We present a simple and scalable text and image conditioned video generation method. Our approach, named STIV, integrates a variable number of image conditions into a Diffusion Transformer (DiT) through frame replacement. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, as well as long video generation through autoregressive rollouts.Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, and multi-view generation, etc.With comprehensive ablation studies on T2I, T2V, TI2V, and long video generation, STIV demonstrate strong performance, despite its simple design. An 8.7B model with \(512^2\) resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at \(512^2\) resolution. Combine all of these, we finally scale up our model to 540p with over 200 frames. By providing a transparent recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress for video generation.
Poster
Zonglin Lyu · Chen Chen
[ Exhibit Hall I ]
Abstract
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves \textbf{20}\% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having \textbf{3}$\times$ fewer parameters. Such a parameter reduction results in \textbf{2.3}$\times$ speed up. By incorporating optical flow guidance, our method requires \textbf{9000}$\times$ less training data and achieves over \textbf{20}$\times$ fewer parameters than video-based …
Poster
Yingjian Chen · Lei Zhang · Yakun Niu
[ Exhibit Hall I ]
Abstract
The rise of generative models has raised concerns about image authenticity online, highlighting the urgent need for a detector that is (1) highly generalizable, capable of handling unseen forgery techniques, and (2) data-efficient, achieving optimal performance with minimal training data, enabling it to counter newly emerging forgery techniques effectively. To achieve this, we propose $\textbf{\textit{ForgeLens}}$, a data-efficient, feature-guided framework that incorporates two lightweight designs to enable a frozen network to focus on forgery-specific features. First, we introduce the Weight-Shared Guidance Module (WSGM), which guides the extraction of forgery-specific features during training. Second, a forgery-aware feature integrator, FAFormer, is used to effectively integrate forgery information across multi-stage features. ForgeLens addresses a key limitation of previous frozen network-based methods, where general-purpose features extracted from large datasets often contain excessive forgery-irrelevant information. As a result, it achieves strong generalization and reaches optimal performance with minimal training data. Experimental results on 19 generative models, including both GANs and diffusion models, demonstrate improvements of 13.61\% in Avg.Acc and 8.69\% in Avg.AP over the base model. Notably, ForgeLens outperforms existing forgery detection methods, achieving state-of-the-art performance with just 1\% of the training data.
Poster
Daniel Winter · Asaf Shul · Matan Cohen · Dana Berman · Yael Pritch · Alex Rav-Acha · Yedid Hoshen
[ Exhibit Hall I ]
Abstract
This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.
Poster
Yanran Zhang · Bingyao Yu · Yu Zheng · Wenzhao Zheng · Yueqi Duan · Lei Chen · Jie Zhou · Jiwen Lu
[ Exhibit Hall I ]
Abstract
The emergence of visual autoregressive (AR) models has revolutionized image generation, while presenting new challenges for synthetic image detection.Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error ($\bf{D^3QE}$) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent.To evaluate our method, we construct a comprehensive dataset covering 7 mainstream visual AR models.Experiments demonstrate superior detection accuracy and strong generalization of $\bf{D^3QE}$ across different AR models, while maintaining robustness under various real-world perturbations.
Poster
Huanpeng Chu · Wei Wu · Guanyu Feng · Yutao Zhang
[ Exhibit Hall I ]
Abstract
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers—stemming from a large number of sampling steps and complex per-step computations—presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model's sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure. In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction. Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
Poster
Mahir Atmis · LEVENT KARACAN · Mehmet SARIGÜL
[ Exhibit Hall I ]
Abstract
Specular highlights, though valuable for human perception, are often undesirable in computer vision and graphics tasks as they can obscure surface details and affect analysis. Existing methods rely on multi-stage pipelines or multi-label datasets, making training difficult. In this study, we propose a one-step diffusion-based model for specular highlight removal, leveraging a pre-trained diffusion-based image generation model with an adaptation mechanism to enhance efficiency and adaptability. To further improve the adaptation process, we introduce ProbLoRA, a novel modification of Low-Rank Adaptation (LoRA), designed to adapt the diffusion model for highlight removal effectively. Our approach surpasses existing methods, achieving state-of-the-art performance in both quantitative metrics and visual quality. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of our method, highlighting its robustness and generalization capabilities.
Poster
Gopika Sudhakaran · Hikaru Shindo · Patrick Schramowski · Simone Schaub-Meyer · Kristian Kersting · Stefan Roth
[ Exhibit Hall I ]
Abstract
Visual relation detection (VRD) is the challenging task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it relies on handcrafted prompts and struggles with novel or complex relationships. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction-tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the detected relationships for segmenting complex scenes.
Poster
Zerui Tao · Yuhta Takida · Naoki Murata · Qibin Zhao · Yuki Mitsufuji
[ Exhibit Hall I ]
Abstract
Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable …
Poster
Jingyi Pan · Dan Xu · Qiong Luo
[ Exhibit Hall I ]
Abstract
Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view; 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes.Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. Our model and code will be made publicly available upon acceptance.
Poster
Saemi Moon · Minjong Lee · Sangdon Park · Dongwoo Kim
[ Exhibit Hall I ]
Abstract
As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.
Poster
Qing Lin · Jingfeng Zhang · YEW-SOON ONG · Mengmi Zhang
[ Exhibit Hall I ]
Abstract
Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. First, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce a new evaluation metric to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and dataset will be made public.
Poster
Zehuan Huang · Yuan-Chen Guo · Haoran Wang · Ran Yi · Lizhuang Ma · Yanpei Cao · Lu Sheng
[ Exhibit Hall I ]
Abstract
Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to high computational costs and degradation in image quality due to scarce high-quality 3D data. This paper introduces MV-Adapter, an efficient and versatile adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
Poster
Alessandro Conti · Massimiliano Mancini · Enrico Fini · Yiming Wang · Paolo Rota · Elisa Ricci
[ Exhibit Hall I ]
Abstract
Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them. Our evaluation suite will be made openly available, serving as a resource for future research.
Poster
Hanling Zhang · Rundong Su · Zhihang Yuan · Pengtao Chen · Mingzhu Shen · Yibo Fan · Shengen Yan · Guohao Dai · Yu Wang
[ Exhibit Hall I ]
Abstract
Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT’s attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68\% reduction in attention FLOPs on 2K image generation without compromising visual fidelity.
Poster
Rui Xie · Rui Xie · Yuzhang Shang · Hanling Zhang · Siyuan Wang · Shengen Yan · Guohao Dai · Yu Wang
[ Exhibit Hall I ]
Abstract
Diffusion Transformer (DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose DLFR-Gen, a training-free approach for Dynamic Latent Frame Rate Generation in Diffusion Transformers. DLFR-Gen adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that DLFR-Gen can achieve a speedup of up to 3 times for video generation with minimal quality degradation.
Poster
Ibtihel Amara · Ahmed Imtiaz Humayun · Ivana Kajic · Zarana Parekh · Natalie Harris · Sarah Young · Chirag Nagpal · Najoung Kim · Junfeng He · Cristina Vasconcelos · Deepak Ramachandran · Golnoosh Farnadi · Katherine Heller · Mohammad Havaei · Negar Rostamzadeh
[ Exhibit Hall I ]
Abstract
Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To enable a more comprehensive evaluation of concept erasure, we introduce EraseBench, a multidimensional framework designed to rigorously assess text-to-image models post-erasure. It encompasses over 100 diverse concepts, carefully curated seeded prompts to ensure reproducible image generation, and dedicated evaluation prompts for model-based assessment. Paired with a robust suite of evaluation metrics, our framework provides a holistic and in-depth analysis of concept erasure’s effectiveness and its long-term impact on model behaviour.Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
Poster
Revant Teotia · Candace Ross · Karen Ullrich · Sumit Chopra · Adriana Romero-Soriano · Melissa Hall · Matthew Muckley
[ Exhibit Hall I ]
Abstract
Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity (“Does” the model generate images with expected attributes?) and generalization capacity (“Can” the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding …
Poster
Anthony Bisulco · Rahul Ramesh · Randall Balestriero · Pratik Chaudhari
[ Exhibit Hall I ]
Abstract
Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning across factors such as masking ratio, patch size, number of encoder and decoder layers, as researchers use these methods for different applications. While prior theoretical work has analyzed MAEs through the lens of attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and the performance on downstream tasks is relatively unexplored. In this work, we investigate the perspective that "MAEs learn spatial correlations in the input image". We analytically derive the features learnt by a linear MAE and show that masking ratio and patch size can be used to select between features capturing short- and long-range spatial correlations. Extending this analysis to nonlinear MAEs, we show that learned representations in MAEs adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.
Poster
Sunung Mun · Jinhwan Nam · Sunghyun Cho · Jungseul Ok
[ Exhibit Hall I ]
Abstract
Text-guided image editing with diffusion models enables flexible modifications, but editing multiple objects remains challenging due to unintended attribute interference, where edits affect non-target regions or mix attributes within the target areas. We identify the End-of-Sequence (EOS) token embeddings as a key factor in this issue, introducing global semantics that disrupt intended modifications. To address this, we propose Attribute-LEakage-free Editing (ALE-Edit), an approach that is both effective, by properly addressing EOS-induced interference, and efficient, as it requires no additional fine-tuning. ALE-Edit consists of: (1) Object-Restricted Embedding (ORE) to localize attributes, (2) Region-Guided Blending for Cross-Attention Masking (RGB-CAM) to align attention with target regions, and (3) Background Blending (BB) to preserve structural consistency. Additionally, we introduce ALE-Bench, a benchmark to quantify target-external and target-internal interference. Experiments show that ALE-Edit reduces unintended changes while maintaining high-quality edits, outperforming existing tuning-free methods. Our approach provides a scalable and computationally efficient solution for multi-object image editing.
Poster
Xiaolong Jin · Zixuan Weng · Hanxi Guo · Chenlong Yin · Siyuan Cheng · Guangyu Shen · Xiangyu Zhang
[ Exhibit Hall I ]
Abstract
Diffusion models are widely used in real-world applications, but ensuring their safety remains a major challenge. Despite many efforts to enhance the security of diffusion models, jailbreak and adversarial attacks can still bypass these defenses, generating harmful content. However, the lack of standardized evaluation makes it difficult to assess the robustness of diffusion model system.To address this, we introduce JailbreakDiffBench, a comprehensive benchmark for systematically evaluating the safety of diffusion models against various attacks and under different defenses. Our benchmark includes a high-quality, human-annotated prompt and image dataset covering diverse attack scenarios. It consists of two key components: (1) an evaluation protocol to measure the effectiveness of moderation mechanisms and (2) an attack assessment module to benchmark adversarial jailbreak strategies.Through extensive experiments, we analyze existing filters and reveal critical weaknesses in current safety measures. JailbreakDiffBench is designed to support both text-to-image and text-to-video models, ensuring extensibility and reproducibility.The code is available at https://anonymous.4open.science/r/jailbreakdiffbench/
Poster
Tongkai Shi · Lianyu Hu · Fanhua Shang · Liqing Gao · Wei Feng
[ Exhibit Hall I ]
Abstract
Sign Language Video Generation (SLVG) aims to transform sign language sequences into natural and fluent sign language videos. Existing SLVG methods lack geometric modeling of human anatomical structures, leading to anatomically implausible and temporally inconsistent generation. To address these challenges, we propose a novel SLVG framework: Geometry-Aware Region Refinement (GReg). GReg uses 3D geometric information (such as normal maps and gradient maps) from the SMPL-X model to ensure anatomical and temporal consistency.To fully leverage the 3D geometric priors, we propose two novel methods: 1) Regional Prior Generation, which uses regional expert networks to generate target-structured regions as generation priors; 2) Gradient-Enhanced Refinement, which guides the refinement of detailed structures in key regions using gradient features.Furthermore, we enhance visual realism in key regions through adversarial training on both these regions and their gradient maps.Experimental results demonstrate that GReg achieves state-of-the-art performance with superior structural accuracy and temporal consistency.
Poster
Yaqing Ding · Viktor Kocur · VACLAV VAVRA · Zuzana Berger Haladova · jian Yang · Torsten Sattler · Zuzana Kukelova
[ Exhibit Hall I ]
Abstract
Recent advances in monocular depth estimation methods (MDE) and their improved accuracy open new possibilities for their applications. In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. In particular, we are interested in answering the question whether using MDEs improves results over traditional point-based methods. We propose a novel framework for estimating the relative pose of two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale or even both unknown scale and shift parameters, our solvers jointly estimate the scale or both the scale and shift parameters along with the relative pose. We derive efficient solvers considering different types of depths for three camera configurations: (1) calibrated cameras, (2) cameras with an unknown shared focal length, and (3) cameras with unknown different focal lengths. Our new solvers outperform state-of-the-art depth-aware solvers in terms of speed and accuracy. In extensive real experiments on multiple datasets and with various MDEs, we discuss which depth-aware solvers are preferable in which situation. The code will be made publicly available.
Poster
Chengtang Yao · Lidong Yu · Zhidan Liu · Jiaxi Zeng · Yuwei Wu · Yunde Jia
[ Exhibit Hall I ]
Abstract
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the …
Poster
Dale Decatur · Thibault Groueix · Wang Yifan · Rana Hanocka · Vladimir Kim · Matheus Gadelha
[ Exhibit Hall I ]
Abstract
Text-to-image diffusion models enable high-quality image generation but are computationally expensive, especially when producing large image collections. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across multiple correlated prompts. Our key insight leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free method that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip’s text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation.
Poster
Tiange Xiang · Kai Li · Chengjiang Long · Christian Häne · Peihong Guo · Scott Delp · Ehsan Adeli · Li Fei-Fei
[ Exhibit Hall I ]
Abstract
Text-to-image diffusion models have seen significant development recently due to increasing availability of paired 2D data. Although a similar trend is emerging in 3D generation, the limited availability of high-quality 3D data has resulted in less competitive 3D diffusion models compared to their 2D counterparts. In this work, we show how 2D diffusion models, originally trained for text-to-image generation, can be repurposed for 3D object generation. We introduce Gaussian Atlas, a representation of 3D Gaussians with dense 2D grids, which enables the fine-tuning of 2D diffusion models for generating 3D Gaussians. Our approach shows a successful transfer learning from a pretrained 2D diffusion model to 2D manifold flattend from 3D structures. To facilitate model training, a large-scale dataset, Gaussian Atlas, is compiled to comprise 205K high-quality 3D Gaussian fittings of a diverse array of 3D objects. Our experiment results indicate that text-to-image diffusion models can also serve as 3D content generators.
Poster
Junyu Xie · Tengda Han · Max Bain · Arsha Nagrani · Eshika Khandelwal · Gül Varol · Weidi Xie · Andrew Zisserman
[ Exhibit Hall I ]
Abstract
Our objective is automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighboring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary Visual-Language Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure -- an action score -- specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.
Poster
Haowei Kuang · Wenhan Yang · Zongming Guo · Jiaying Liu
[ Exhibit Hall I ]
Abstract
Learned image compression aims to reduce redundancy by accurately modeling the complex signal distribution inherent in images with network parameters. However, existing practices that train models on entire dataset offline face a limitation, as the estimated distribution only approximates the general image signal distribution and fails to capture image-specific characteristics. To address this issue, we propose a cross-granularity online optimization strategy to mitigate information loss from two key aspects: statistical distribution gaps and local structural gaps. This strategy introduces additional fitted bitstream to push the estimated signal distribution closer to the real one at both coarse-grained and fine-grained levels. For coarse-grained optimization, we relax the common bitrate constraints during gradient descent and reduce bitrate cost via adaptive QP (Quantization Parameter) selection, preventing information collapse and narrowing the statistical distribution gaps. For fine-grained optimization, a Mask-based Selective Compensation Module is designed to sparsely encode structural characteristics at low bitrates, enhancing local distribution alignment. By jointly optimizing global and local distributions, our method achieves closer alignment to real image statistics and significantly enhances the performance. Extensive experiments validate the superiority of our method as well as the design of our module. Our project will be publicly available.
Poster
Nupur Kumari · Xi Yin · Jun-Yan Zhu · Ishan Misra · Samaneh Azadi
[ Exhibit Hall I ]
Abstract
Customization of text-to-image models enables users to insert custom concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that conditions on reference images via a shared attention mechanism to better incorporate fine-grained visual details from reference images. Finally, we propose a new inference technique that normalizes text and image guidance vectors to mitigate overexposure issues during inference. Through extensive experiments, we show that our encoder-based model, trained on the synthetic dataset with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.
Poster
Inkyu Shin · Chenglin Yang · Liang-Chieh (Jay) Chen
[ Exhibit Hall I ]
Abstract
Flow-based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer’s output under-utilizes the rich inter-layer representations, potentially impeding model convergence. To address this limitation, we introduce **DeepFlow**, a novel framework that enhances velocity representation through inter-layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges **8x faster** on ImageNet-256x256 with equivalent performance and further reduces FID by **2.6** while halving training time compared to previous flow-based models without a classifier-free guidance. DeepFlow also outperforms baselines in text-to-image generation tasks, as evidenced by evaluations on MS-COCO and zero-shot GenEval. The code will be made publicly available.
Poster
Rui Yang · Huining Li · Yiyi Long · Xiaojun Wu · Shengfeng He
[ Exhibit Hall I ]
Abstract
Generating sketches guided by reference styles requires precise transfer of stroke attributes, such as line thickness, deformation, and texture sparsity, while preserving semantic structure and content fidelity. To this end, we propose Stroke2Sketch, a novel training-free framework that introduces cross-image stroke attention, a mechanism embedded within self-attention layers to establish fine-grained semantic correspondences and enable accurate stroke attribute transfer. This allows our method to adaptively integrate reference stroke characteristics into content images while maintaining structural integrity. Additionally, we develop adaptive contrast enhancement and semantic-focused attention to reinforce content preservation and foreground emphasis. Stroke2Sketch effectively synthesizes stylistically faithful sketches that closely resemble handcrafted results, outperforming existing methods in expressive stroke control and semantic coherence.
Poster
Jinhong Ni · Chang-Bin Zhang · Qiang Zhang · Jing Zhang
[ Exhibit Hall I ]
Abstract
Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion, has stimulated research to adapt them to 360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises questions about the underlying mechanisms enabling this empirical success. We hypothesize and examine that the trainable counterparts exhibit distinct behaviors when fine-tuned on panoramic data, and such an adaptation conceals some intrinsic mechanism to leverage the prior knowledge within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the attention modules are responsible for common information that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for panorama generation. We empirically verify these insights by introducing a simple framework called UniPano, with the objective of establishing an elegant baseline for future research. UniPano not only outperforms existing methods but also significantly reduces memory usage and training time compared to prior dual-branch approaches, …
Poster
Wenchuan Wang · Mengqi Huang · Yijing Tu · Zhendong Mao
[ Exhibit Hall I ]
Abstract
Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic $\textbf{mutual constraints and synergistic interdependencies}$ between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce $\textbf{DualReal}$, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) $\textbf{Dual-aware Adaptation}$ dynamically selects a training phase ($\textit{i.e.}$, identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) $\textbf{StageBlender Controller}$ leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive evaluation benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by $\textbf{21.7}$% and $\textbf{31.8}$% on average, and achieves top performance on nearly all motion quality metrics.
Poster
Naresh Kumar Devulapally · Mingzhen Huang · Vishal Asnani · Shruti Agarwal · Siwei Lyu · Vishnu Lokhande
[ Exhibit Hall I ]
Abstract
Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of text-to-image Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings $\mathcal{W}_*$, we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional whole-image watermarking. This method also leverages the text encoder’s compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves $99 \%$ bit accuracy ($48$ bits) with a $10^5 \times$ reduction in model parameters, enabling efficient watermarking.
Poster
Yulin Pan · Xiangteng He · Chaojie Mao · Zhen Han · Zeyinzi Jiang · Jingfeng Zhang · Yu Liu
[ Exhibit Hall I ]
Abstract
Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between …
Poster
Haiming Zhu · Yangyang Xu · Chenshu Xu · Tingrui Shen · Wenxi Liu · Yong Du · Jun Yu · Shengfeng He
[ Exhibit Hall I ]
Abstract
Text-guided image and 3D editing have advanced with diffusion-based models, yet methods like Delta Denoising Score often struggle with stability, spatial control, and editing strength. These limitations stem from reliance on complex auxiliary structures, which introduce conflicting optimization signals and restrict precise, localized edits. We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. Specifically, SSD utilizes CFG equation to achieves cross-prompt alignment, and introduces a constant term null-text branch to stabilize the optimization process. This approach preserves the original content’s structure and ensures that editing trajectories are closely aligned with the source prompt, enabling smooth, prompt-specific modifications while maintaining coherence in surrounding regions. Additionally, SSD incorporates a prompt enhancement branch to boost editing strength, particularly for style transformations. Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and textdriven style edits, with faster convergence and reduced complexity, providing a robust and efficient solution for text-guided editing.
Poster
Tianrui Zhu · Shiyi Zhang · Jiawei Shao · Yansong Tang
[ Exhibit Hall I ]
Abstract
Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to $O(1)$ using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods.
Poster
Junchao Huang · Xinting Hu · Shaoshuai Shi · Zhuotao Tian · Li Jiang
[ Exhibit Hall I ]
Abstract
Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications.We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.
Poster
Ju He · Qihang Yu · Qihao Liu · Liang-Chieh (Jay) Chen
[ Exhibit Hall I ]
Abstract
Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm—directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3$\times$ at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds—all while delivering performance comparable to state-of-the-art models. Code will be available.
Poster
Xin Wen · Bingchen Zhao · Ismail Elezi · Jiankang Deng · Xiaojuan Qi
[ Exhibit Hall I ]
Abstract
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space—a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference. Our code and models will be made publicly available.
Poster
Minghao Fu · Guo-Hua Wang · Xiaohao Chen · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang
[ Exhibit Hall I ]
Abstract
Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG's reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (**Te**xt **E**mbeddings **Fusion**), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model's complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher's output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher's performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6$\times$ faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher's complex sampling approach.
Poster
Christian Simon · Masato Ishii · Akio Hayakawa · Zhi Zhong · Shusuke Takahashi · Takashi Shibuya · Yuki Mitsufuji
[ Exhibit Hall I ]
Abstract
In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either heavy memory requirements or sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks.
Poster
Minghan LI · Chenxi Xie · Yichen Wu · Lei Zhang · Mengyu Wang
[ Exhibit Hall I ]
Abstract
Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce **FiVE**, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks.Additionally, we adapt the latest rectified flow (RF) T2V generation models—Pyramid-Flow and Wan2.1—by introducing FlowEdit, resulting in training-free and inversion-free video editing models **Pyramid-Edit** and **Wan-Edit**. We compare six diffusion-based editing methods with our proposed two RF-based editing methods on our proposed FiVE benchmark, evaluating them across 14 metrics. These metrics include background preservation, text-video similarity, temporal consistency, and generated video quality. To further enhance object-level evaluation, we introduce **FiVE-Acc**, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More …
Poster
Zixin Zhu · Kevin Duarte · Mamshad Nayeem Rizve · Chengyuan Xu · Ratheesh Kalarot · Junsong Yuan
[ Exhibit Hall I ]
Abstract
In text-to-image (T2I) generation, achieving fine-grained control over attributes - such as age or smile - remains challenging, even with detailed text prompts. Slider-based methods offer a solution for precise control of image attributes.Existing approaches typically train individual adapter for each attribute separately, overlooking the entanglement among multiple attributes. As a result, interference occurs among different attributes, preventing precise control of multiple attributes together. To address this challenge, we aim to disentangle multiple attributes in slider-based generation to enbale more reliable and independent attribute manipulation. Our approach, CompSlider, can generate a conditional prior for the T2I foundation model to control multiple attributes simultaneously. Furthermore, we introduce novel disentanglement and structure losses to compose multiple attribute changes while maintaining structural consistency within the image. Since CompSlider operates in the latent space of the conditional prior and does not require retraining the foundation model, it reduces the computational burden for both training and inference. We evaluate our approach on a variety of image attributes and highlight its generality by extending to video generation.
Poster
Yuhui WU · Liyi Chen · Ruibin Li · Shihao Wang · Chenxi Xie · Lei Zhang
[ Exhibit Hall I ]
Abstract
Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality \textbf{Ins}truction-based \textbf{Vi}deo \textbf{E}diting dataset with \textbf{1M} triplets, namely \textbf{InsViE-1M}. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M …
Poster
Zhaotong Yang · Yuhui Li · Shengfeng He · Xinzhe Li · Yangyang Xu · Junyu Dong · Yong Du
[ Exhibit Hall I ]
Abstract
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge.We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously.Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene.
Poster
Dewei Zhou · Mingwei Li · Zongxin Yang · Yi Yang
[ Exhibit Hall I ]
Abstract
Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7\% over FLUX and …
Poster
Yuzhuo Chen · Zehua Ma · Han Fang · Weiming Zhang · Nenghai Yu
[ Exhibit Hall I ]
Abstract
AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and high performance of generative image editing applications have elevated the risks of malicious tampering, creating new demands. 1) The tamper robustness of current lossless visual quality watermarks remains constrained by the modification-sensitive diffusion inversion process, necessitating enhanced robustness. 2) The improved tampering quality and rapid iteration cycles render passive tampering detection methods inadequate, making proactive tampering localization capability a desired feature for watermarks. To address these requirements, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises three key modules: a dual-mark joint sampling algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, a dense variation region detector leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and a tamper-aware message decoder guided by localization results. The experimental results indicate that TAG-WM achieves SOTA tampering robustness and tampering localization capability with distortions while …
Poster
jian ma · Qirong Peng · Xu Guo · Chen Chen · Haonan Lu · Zhenyu Yang
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, …
Poster
Tianyi Wei · Yifan Zhou · Dongdong Chen · Xingang Pan
[ Exhibit Hall I ]
Abstract
The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.
Poster
Yukuan Min · Muli Yang · Jinhao Zhang · Yuxuan Wang · Aming WU · Cheng Deng
[ Exhibit Hall I ]
Abstract
To promote the deployment of scenario understanding in the real world, Open-Vocabulary Scene Graph Generation (OV-SGG) has attracted much attention recently, aiming to generalize beyond the limited number of relation categories labeled during training and detect those unseen relations during inference. Towards OV-SGG, one feasible solution is to leverage the large-scale pre-trained vision-language models (VLMs) containing plentiful category-level content to capture accurate correspondences between images and text. However, due to the lack of quadratic relation-aware knowledge in VLMs, directly using the category-level correspondence in the base dataset could not sufficiently represent generalized relations involved in open world. Therefore, designing an effective open-vocabulary relation mining framework is challenging and meaningful. To this end, we propose a novel Vision-Language Interactive Relation Mining model (VL-IRM) for OV-SGG, which explores learning generalized relation-aware knowledge through multi-modal interaction. Specifically, first, to enhance the generalization of the relation text to visual content, we present a generative relation model to make the text modality explore possible open-ended relations based on visual content. Then, we employ visual modality to guide the relation text for spatial and semantic extension. Extensive experiments demonstrate the superior OV-SGG performance of our method.
Poster
Guanning Zeng · Xiang Zhang · Zirui Wang · Haiyang Xu · Zeyuan Chen · Bingnan Li · Zhuowen Tu
[ Exhibit Hall I ]
Abstract
We propose YOLO-Count, a new differentiable open-vocabulary object counting model that addresses both general counting challenges and enables training-free quantity control for text-to-image (T2I) generation. A key contribution is the `cardinality' map, a novel regression target designed to account for object size and location variations. By employing representation alignment and a hybrid supervision scheme, YOLO-Count minimizes the discrepancy between open-vocabulary counting and T2I generation control. The model's differentiable architecture facilitates gradient-based optimization for accurate object counts, leading to enhanced controllability and transparency in T2I systems. Our empirical evaluation demonstrates state-of-the-art counting accuracy and effective quantity control for the T2I generation tasks.
Poster
Chang Liu · Viraj Shah · Aiyu Cui · Svetlana Lazebnik
[ Exhibit Hall I ]
Abstract
This paper introduces UnZipLoRA, a method for decomposing an image into its constituent subject and style, represented as two distinct LoRAs (Low-Rank Adaptations). Unlike existing personalization techniques that focus on either subject or style in isolation, or require separate training sets for each, UnZipLoRA disentangles these elements from a single image by training both the LoRAs simultaneously. UnZipLoRA ensures that the resulting LoRAs are compatible, i.e., they can be seamlessly combined using direct addition. UnZipLoRA enables independent manipulation and recontextualization of subject and style, including generating variations of each, applying the extracted style to new subjects, and recombining them to reconstruct the original image or create novel variations. To address the challenge of subject and style entanglement, UnZipLoRA employs a novel prompt separation technique, as well as column and block separation strategies to accurately preserve the characteristics of subject and style, and ensure compatibility between the learned LoRAs. Evaluation with human studies and quantitative metrics demonstrates UnZipLoRA's effectiveness compared to other state-of-the-art methods, including DreamBooth-LoRA, Inspiration Tree, and B-LoRA.
Poster
U-Chae Jun · Jaeeun Ko · Jiwoo Kang
[ Exhibit Hall I ]
Abstract
We introduce a novel generative framework that unifies adversarial and diffusion-based training to overcome the limitations of conventional models. Our approach, termed Generative Adversarial Diffusion (GAD), integrates an adversarial loss directly into each denoising step of a latent diffusion model. By employing a single U-Net as a unified generator and discriminator, our framework eliminates the need for a separate discriminator, thereby reducing memory overhead and mitigating common GAN issues such as mode collapse and training instability. This integrated adversarial regularizer promotes semantic information exchange across timesteps, enabling the model to better capture complex data distributions even when training data is scarce or biased. Extensive experiments on standard latent diffusion benchmarks demonstrate that GAD significantly enhances image quality and mode coverage in tasks including text-to-image and image-to-3D generation. Our results suggest that unifying adversarial and diffusion-based training in a single network offers a promising new direction for high-fidelity, stable image synthesis.
Poster
Priyank Pathak · Yogesh Rawat
[ Exhibit Hall I ]
Abstract
Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing changes. Existing approaches often rely on additional models or annotated attributes to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color—specifically foreground and background colors—as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), a RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce \textbf{S2A self-attention}, a novel mechanism designed to separate color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID baseline, and 1.0% on CCVID and 3.6% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as …
Poster
Yiming Gong · Zhen Zhu · Minjia Zhang
[ Exhibit Hall I ]
Abstract
We propose a fast text-guided image editing method called InstantEdit based on the RectifiedFlow framework, which is structured as a few-step editing process that preserves critical content while following closely to textual instructions. Our approach leverages the straight sampling trajectories of RectifiedFlow by introducing a specialized inversion strategy called PerRFI to enhance inversion accuracy. To seamlessly integrate PerRFI with our backbone RectifiedFlow model, we further propose a novel regeneration method, Inversion Latent Injection, which effectively reuses latent information obtained during inversion to facilitate more coherent and detailed regeneration. Additionally, we propose a Disentangled Prompt Guidance technique to balance editability with detail preservation, and integrate a Canny-conditioned ControlNet to incorporate structural cues and suppress artifacts. Evaluation on the PIE image editing dataset demonstrates that InstantEdit is not only fast but also achieves better qualitative and quantitative results compared to state-of-the-art few-step editing methods.
Poster
Yanzuo Lu · Yuxi Ren · Xin Xia · Shanchuan Lin · XING WANG · Xuefeng Xiao · Jinhua Ma · Xiaohua Xie · Jianhuang Lai
[ Exhibit Hall I ]
Abstract
Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators.Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications.To circumvent this inherent drawback, we propose **Adversarial Distribution Matching (ADM)**, a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner.In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces.Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage.By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed **DMDX**, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time.Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
Poster
Bowen Fu · Wei Wei · Jiaqi Tang · Jiangtao Nie · Yanyu Ye · Xiaogang Xu · Ying-Cong Chen · Lei Zhang
[ Exhibit Hall I ]
Abstract
Controllable diffusion models have been widely applied in image stylization. However, existing methods often treat the style in the reference image as a single, indivisible entity, which makes it difficult to transfer specific stylistic attributes. To address this issue, we propose a fine-grained controllable image stylization framework, Co-Painter, to decouple multiple attributes embedded in the reference image and adaptively inject it into the diffusion model. We first build a multi-condition image stylization framework based on the text-to-image generation model. Then, to drive it, we develop a fine-grained decoupling mechanism to implicitly separate the attributes from the image. Finally, we design a gated feature injection mechanism to adaptively regulate the importance of multiple attributes. To support the above procedure, we also build a dataset with fine-grained styles. It comprises nearly 48,000 image-text pairs samples. Extensive experiments demonstrate that the proposed model achieves an optimal balance between text alignment and style similarity to reference images, both in standard and fine-grained settings.
Poster
Haowen Li · Zhenfeng Fan · Zhang Wen · Zhengzhou Zhu · Yunjin Li
[ Exhibit Hall I ]
Abstract
Image composition has advanced significantly with large-scale pre-trained T2I diffusion models. Despite progress in same-domain composition, cross-domain composition remains under-explored. The main challenges are the stochastic nature of diffusion models and the style gap between input images, leading to failures and artifacts. Additionally, heavy reliance on text prompts limits practical applications.This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. Our method is efficient and robust, preserving the diffusion prior, as it involves minor steps after initial image blending without additional interference in the diffusion process. Our method uses a multilayer perceptron to integrate CLIP features from foreground and background images, manipulating diffusion steps with a cross-attention strategy. It effectively preserves foreground content while enabling stable stylization without a pre-stylization network. We also create a benchmark dataset with diverse contents and styles for fair evaluation, addressing the lack of testing datasets for cross-domain image composition.Our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations, reducing LPIPS scores by $30.5$\% and improving CSD metrics by $18.1$\%. We believe our method will advance future research and applications. The code and benchmark will be publicly available.
Poster
XINQI LYU · Yihao LIU · Yanjie Li · Bin Xiao
[ Exhibit Hall I ]
Abstract
Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods.Warning: This paper may contain offensive model-generated content.
Poster
Jingye Chen · Zhaowen Wang · Nanxuan Zhao · Li Zhang · Difan Liu · Jimei Yang · Qifeng Chen
[ Exhibit Hall I ]
Abstract
Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages: (1) reference creation, (2) design planning, and (3) layer generation. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such …
Poster
Jie Shao · Hanxiao Zhang · Hao Yu · Jianxin Wu
[ Exhibit Hall I ]
Abstract
The rapid progress in generative models has significantly enhanced the quality of image generation. However, as these models grow larger, deploying and fine-tuning them becomes increasingly challenging. While conventional quantization techniques help reduce model size, they struggle to achieve high compression rates without significant performance loss. As a result, memory footprint remains a critical challenge for generative models. In this work, we explore the extreme compression of generative models through codebook quantization, drastically reducing model size while maintaining performance. We extend product quantization for model compression, significantly increasing codebook capacity, which is crucial for preserving the generative quality of diffusion models. We also introduce a codebook compression method for memory efficiency. To further minimize performance degradation, we develop EM calibration with re-initialization that optimizes both assignments and centroids. By compressing the model to as low as 1 bit (achieving a 13$\times$ reduction in model size), we obtain a highly compact generative model with remarkable image quality. Extensive experiments on ImageNet demonstrate the superiority of our method over existing techniques. Furthermore, we validate its effectiveness across various generation, language and 3D tasks, highlighting its broad applicability and robust performance.
Poster
Hengyu Meng · Duotun Wang · Zhijing Shao · Ligang Liu · Zeyu Wang
[ Exhibit Hall I ]
Abstract
Professional 3D asset creation often requires diverse sculpting brushes to add surface details and geometric structures.Despite recent progress in 3D generation, producing reusable sculpting brushes compatible with artists' workflows remains an open and challenging problem.These sculpting brushes are typically represented as vector displacement maps (VDMs), which existing models cannot easily generate compared to natural images.This paper presents Text2VDM, a novel framework for text-to-VDM brush generation through the deformation of a dense planar mesh guided by score distillation sampling (SDS).The original SDS loss is designed for generating full objects and struggles with generating desirable sub-object structures from scratch in brush generation.We refer to this issue as semantic coupling, which we address by introducing weighted blending of prompt tokens to SDS, resulting in a more accurate target distribution and semantic guidance.Experiments demonstrate that Text2VDM can generate diverse, high-quality VDM brushes for sculpting surface details and geometric structures.Our generated brushes can be seamlessly integrated into mainstream modeling software, enabling various applications such as mesh stylization and real-time interactive modeling.
Poster
Haonan Qiu · Shiwei Zhang · Yujie Wei · Ruihang Chu · Hangjie Yuan · Xiang Wang · Yingya Zhang · Ziwei Liu
[ Exhibit Hall I ]
Abstract
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose $\textbf{FreeScale}$, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the $\textbf{8k}$-resolution text-to-image generation for the first time.
Poster
Yiren Song · Xiaokang Liu · Mike Zheng Shou
[ Exhibit Hall I ]
Abstract
Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.
Poster
Anlin Zheng · Haochen Wang · Yucheng Zhao · Weipeng DENG · Tiancai Wang · Xiangyu Zhang · Xiaojuan Qi
[ Exhibit Hall I ]
Abstract
The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code will be publicly …
Poster
Azim Ospanov · Mohammad Jalali · Farzan Farnia
[ Exhibit Hall I ]
Abstract
The use of CLIP embeddings to assess the alignment of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the relevance of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which is responsible for generating diverse images from similar text prompts. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the *Schur Complement Entropy (SCE)* score, a measure of the intrinsic diversity of a text-to-image model based on data collected with varying text prompts. Additionally, we demonstrate the use of the Schur complement-based decomposition to nullify the influence of a given prompt in the CLIP embedding of an image, enabling focus or defocus of embeddings on specific objects or properties for downstream tasks. We …
Poster
Xuan Han · Yihao Zhao · Yanhao Ge · Mingyu You
[ Exhibit Hall I ]
Abstract
With its extensive applications, Foreground Conditioned Out-painting (FCO) has attracted considerable attention in the research field. Through the utilization of text-driven FCO, users are enabled to generate diverse backgrounds for a given foreground by adjusting the text prompt, which considerably enhances the efficiency in fields like e-commerce. Since the foreground is fixed in FCO, a key concern is whether the generated background can match the foreground well to achieve a coherent composition. However, most existing methods are lacking in this regard. Artifacts and incorrect interactions are common defects in synthesized images. This issue is linked to the influence of the initial noise in the sampling process. As the initial noise is sampled independently, it's highly likely that the implied image composition will conflict with the given foreground. In this paper, a novel Initialization Policy Model (IPM) is proposed to address this problem. Its function is to replace the early denoising steps and directly predict the intermediate state that is conducive to the reasonable image composition. Since the IPM is designed to take only the foreground image and the text prompt as inputs, it isolates the impact of the initial noise. The subsequently proposed training paradigm that combines inversion-derived label supervision …
Poster
Pedro Vélez · Luisa Polania Cabrera · Yi Yang · Chuhan Zhang · Rishabh Kabra · Anurag Arnab · Mehdi S. M. Sajjadi
[ Exhibit Hall I ]
Abstract
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis.This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we analyze the performance of latent image and video diffusion representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. For the most informative comparison, we utilize the same model architecture, WALT, across image and video generation objectives. Our results show that video generation pre-training consistently outperforms its image counterpart, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
Poster
Zhe Ma · Qingming Li · Xuhong Zhang · Tianyu Du · Ruixiao Lin · Zonghui Wang · Shouling Ji · Wenzhi CHEN
[ Exhibit Hall I ]
Abstract
The past few years have witnessed substantial advances in image generation powered by diffusion models. However, it was shown that diffusion models are susceptible to training data memorization, raising significant concerns regarding copyright infringement and privacy invasion. This study delves into a rigorous analysis of memorization in diffusion models. We introduce InvMM, an inversion-based measure of memorization, which is based on inverting an sensitive latent noise distribution accounting for the replication of an image. For accurate estimation of the measure, we propose an adaptive algorithm that balances the normality and sensitivity of the noise distribution. Comprehensive experiments across four datasets, conducted on both unconditional and text-guided diffusion models, demonstrate that InvMM provides a reliable and complete quantification of memorization. Notably, InvMM is commensurable between samples, reveals the true extent of memorization from an adversarial standpoint and implies how memorization differs from membership. In practice, it serves as an auditing tool for developers to reliably assess the risk of memorization, thereby contributing to the enhancement of trustworthiness and privacy-preserving capabilities of diffusion models.
Poster
Aysan Aghazadeh · Adriana Kovashka
[ Exhibit Hall I ]
Abstract
We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) methods and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. We show that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models' capabilities in producing images that are better aligned, more creative, and more persuasive.
Poster
Zuhao Yang · Jiahui Zhang · Yingchen Yu · Shijian Lu · Song Bai
[ Exhibit Hall I ]
Abstract
Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high‑quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high‑fidelity, and semantic‑coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine‑tuning and representation alignment regularization to mitigate the limitations of pre‑trained image‑to‑video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.
Poster
JianHui Zhang · Shen Cheng · Qirui Sun · Jia Liu · Wang Luyang · chaoyu feng · Chen Fang · LEI LEI · Jue Wang · Shuaicheng Liu
[ Exhibit Hall I ]
Abstract
In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment—two critical challenges in image inpainting that intensify with increasing resolution and texture complexity.Patch-Adapter leverages a two-stage adapter architecture to scale the Diffusion models's resolution from 1K to 4K+ without requiring structural overhauls:(1)Dual Context Adapter: Learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency.(2)Reference Patch Adapter: Implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion.This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and photo-concept-bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence. The code will be open-sourced.
Poster
Shengbang Tong · David Fan · Jiachen Zhu · Yunyang Xiong · Xinlei Chen · Koustuv Sinha · Michael Rabbat · Yann LeCun · Saining Xie · Zhuang Liu
[ Exhibit Hall I ]
Abstract
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
Poster
Quang-Binh Nguyen · Minh Luu · Quang Nguyen · Anh Tran · Khoi Nguyen
[ Exhibit Hall I ]
Abstract
Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored explicit content-style decomposition, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance on par with diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity
Poster
Runtao Liu · I Chen · Jindong Gu · Jipeng Zhang · Renjie Pi · Qifeng Chen · Philip Torr · Ashkan Khakzar · Fabio Pizzati
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce AlignGuard, a method for safety alignment of T2I models. We enable the application of Direct Preference Optimization (DPO) for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. AlignGuard consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. We will release code and models.
Poster
Zhuoling Li · Haoxuan Qu · Jason Kuen · Jiuxiang Gu · Qiuhong Ke · Jun Liu · Hossein Rahmani
[ Exhibit Hall I ]
Abstract
Intellectual property (IP) protection for diffusion models is a critical concern, given the significant resources and time required for their development. To effectively safeguard the IP of diffusion models, a key step is enabling the comparison of unique identifiers (fingerprints) between suspect and victim models. However, performing robust and effective fingerprint comparisons among diffusion models remains an under-explored challenge, particularly for diffusion models that have already been released. To address this, in this work, we propose \textbf{DiffIP}, a novel framework for robust and effective fingerprint comparison between suspect and victim diffusion models. Extensive experiments demonstrate the efficacy of our framework.
Poster
Yuxuan Wang · Tianwei Cao · Huayu Zhang · Zhongjiang He · Kongming Liang · Zhanyu Ma
[ Exhibit Hall I ]
Abstract
Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.
Poster
Ryan Ramos · Vladan Stojnić · Giorgos Kordopatis-Zilos · Yuta Nakashima · Giorgos Tolias · Noa Garcia
[ Exhibit Hall I ]
Abstract
Prior work has analyzed the robustness of deep models to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions.We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact—either positively or negatively—on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels.
Poster
Yilin Wang · Zunlei Feng · Jiachi Wang · Hengrui Lou · Binjia Zhou · Jie Lei · Mingli Song · Yijun Bei
[ Exhibit Hall I ]
Abstract
The rapid development of AIGC technology has enabled highly realistic forged images to deceive human perception, posing serious risks across many areas. Current deepfake image detection methods primarily identify forgeries by extracting handcrafted features, deep features, and frequency-domain features. While these features contain forgery traces, they also include a substantial amount of the image's semantic information, which interferes with the precision and generalization of forgery detection models. To tackle these challenges, this paper introduces a novel forgery image identification method based on the Spatial-Temporal Forgery Trace (STFT). Motivated by the fact that forgery images are more easily fitted to a specific distribution than real images, the STFT method approaches the issue from a forged image distribution modeling perspective, employing generative diffusion models to meticulously capture the temporal distribution of images. It further models the relationship between temporal feature variations and spatially corresponding temporal features, treating them as temporal and spatial forgery traces. Moreover, STFT incorporates frequency-domain features as weighting factors to accelerate the localization of spatio-temporal forgery traces. Experiments demonstrate that by integrating spatial, temporal, and frequency perspectives within the latent space, STFT effectively captures subtle spatio-temporal forgery traces, exhibiting strong robustness and generalizability. It outperforms state-of-the-art methods on major …
Poster
Siyoon Jin · Jisu Nam · Jiyoung Kim · Dahyun Chung · Yeong-Seok Kim · Joonhyung Park · HeonJeong Chu · Seungryong Kim
[ Exhibit Hall I ]
Abstract
Exemplar-based semantic image synthesis generates images aligned with semantic content while preserving the appearance of an exemplar. Conventional structure-guidance models like ControlNet, are limited as they rely solely on text prompts to control appearance and cannot utilize exemplar images as input. Recent tuning-free approaches address this by transferring local appearance via implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, prior works are often restricted to single-object cases or foreground object appearance transfer, struggling with complex scenes involving multiple objects. To overcome this, we propose AM-Adapter (Appearance Matching Adapter) to address exemplar-based semantic image synthesis in-the-wild, enabling multi-object appearance transfer from a single scene-level image. AM-Adapter automatically transfers local appearances from the scene-level input. AM-Adapter alternatively provides controllability to map user-defined object details to specific locations in the synthesized images. Our learnable framework enhances cross-image matching within augmented self-attention by integrating semantic information from segmentation maps. To disentangle generation and matching, we adopt stage-wise training. We first train the structure-guidance and generation networks, followed by training the matching adapter while keeping the others frozen. During inference, we introduce an automated exemplar retrieval method for selecting exemplar image-segmentation pairs efficiently. Despite utilizing minimal learnable parameters, AM-Adapter achieves …
Poster
Zhikai Chen · Fuchen Long · Zhaofan Qiu · Ting Yao · Wengang Zhou · Jiebo Luo · Tao Mei
[ Exhibit Hall I ]
Abstract
Recent advances in video generation have demonstrated the utility of powerful diffusion models. One important direction among them is to enhance the visual quality of the AI-synthesized videos for artistic creation. Nevertheless, solely relying on the knowledge embedded in the pre-trained video diffusion models might limit the generalization ability of local details (e.g., texture). In this paper, we address this issue by exploring the visual cues from a high-quality (HQ) image reference to facilitate visual details generation in video enhancement. We present GenVE, a new recipe of generative video enhancement framework that pursues the semantic and texture alignment between HQ image reference and denoised video in diffusion. Technically, GenVE first leverages an image diffusion model to magnify a key frame of the input video to attain a semantics-aligned HQ image reference. Then, a video controller is integrated into 3D-UNet to capture patch-level texture of the image reference to enhance fine-grained details generation at the corresponding region of low-quality (LQ) video. Moreover, a series of conditioning augmentation strategies are implemented for effective model training and algorithm robustness. Extensive experiments conducted on the public YouHQ40 and VideoLQ, as well as self-built AIGC-Vid dataset, quantitatively and qualitatively demonstrate the efficacy of our GenVE …
Poster
Yingsong Huang · Hui Guo · Jing Huang · Bing Bai · Qi Xiong
[ Exhibit Hall I ]
Abstract
The rapid progress of diffusion models highlights the growing need for detecting generated images. Previous research demonstrates that incorporating diffusion-based measurements, such as reconstruction error, can enhance the generalizability of detectors. However, ignoring the differing impacts of aleatoric and epistemic uncertainty on reconstruction error can undermine detection performance. Aleatoric uncertainty, arising from inherent data noise, creates ambiguity that impedes accurate detection of generated images. As it reflects random variations within the data (e.g., noise in natural textures), it does not help distinguish generated images. In contrast, epistemic uncertainty, which represents the model's lack of knowledge about unfamiliar patterns, supports detection. In this paper, we propose a novel framework, Diffusion Epistemic Uncertainty with Asymmetric Learning (DEUA), for detecting diffusion-generated images. We introduce Diffusion Epistemic Uncertainty (DEU) estimation via the Laplace approximation to assess the proximity of data to the manifold of diffusion-generated samples. Additionally, an asymmetric loss function is introduced to train a balanced classifier with larger margins, further enhancing generalizability. Extensive experiments on large-scale benchmarks validate the state-of-the-art performance of our method.
Poster
Lingxiao Li · Kaixuan Fan · Boqing Gong · Xiangyu Yue
[ Exhibit Hall I ]
Abstract
Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.
Poster
Rui Xie · Yinhong Liu · Penghao Zhou · Chen Zhao · Jun Zhou · Kai Zhang · Zhenyu Zhang · Jian Yang · Zhenheng Yang · Ying Tai
[ Exhibit Hall I ]
Abstract
Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (e.g., CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce STAR (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the modelto focus on different frequency components across diffusion steps. Extensive experiments demonstrate STAR outperforms state-of-the-art methods on both synthetic and real-world datasets.
Poster
Jingwei Liu · Ling Yang · Hao Luo · Fan Wang · Hongyan Li · Mengdi Wang
[ Exhibit Hall I ]
Abstract
The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited LLM context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models.
Poster
Zhuokun Chen · Jugang Fan · Zhuowei Yu · Bohan Zhuang · Mingkui Tan
[ Exhibit Hall I ]
Abstract
Visual autoregressive modeling, based on the next-scale prediction paradigm, exhibits notable advantages in image quality and model scalability over traditional autoregressive and diffusion models. It generates images by progressively refining resolution across multiple stages. However, the computational overhead in high-resolution stages remains a critical challenge due to the substantial number of tokens involved. In this paper, we introduce SparseVAR, a plug-and-play acceleration framework for next-scale prediction that dynamically excludes low-frequency tokens during inference without requiring additional training. Our approach is motivated by the observation that tokens in low-frequency regions have a negligible impact on image quality in high-resolution stages and exhibit strong similarity with neighboring tokens. Additionally, we observe that different blocks in the next-scale prediction model focus on distinct regions, with some concentrating on high-frequency areas. SparseVAR leverages these insights by employing lightweight MSE-based metrics to identify low-frequency tokens while preserving the fidelity of excluded regions through a small set of uniformly sampled anchor tokens. By significantly reducing the computational cost while maintaining high image generation quality, SparseVAR achieves notable acceleration in both HART and Infinity. Specifically, SparseVAR achieves up to a 2× speedup with minimal quality degradation in Infinity-2B.
Poster
Xuran Ma · Yexin Liu · Yaofu LIU · Xianfeng Wu · Mingzhe Zheng · Zihao Wang · Ser-Nam Lim · Harry Yang
[ Exhibit Hall I ]
Abstract
Video generation using diffusion models has shown remarkable progress, yet it remains computationally expensive due to the repeated processing of redundant features across blocks and steps. To address this, we propose a novel adaptive feature reuse mechanism that dynamically identifies and caches the most informative features by focusing on foreground and caching more on background, significantly reducing computational overhead with less sacrificing video quality. By leveraging the step and block caching, our method achieves up to 1.8× speed up on HunyuanVideo while maintaining competitive performance on Vbench, PSNR, SSIM, FID and LPIPS. Extensive experiments demonstrate that our approach not only improves efficiency but also enhances the quality of generated videos. The proposed method is generalizable and can be integrated into existing diffusion transformer frameworks.
Poster
Tsu-Jui Fu · Yusu Qian · Chen Chen · Wenze Hu · Zhe Gan · Yinfei Yang
[ Exhibit Hall I ]
Abstract
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.
Poster
Hyungjin Kim · Seokho Ahn · Young-Duk Seo
[ Exhibit Hall I ]
Abstract
Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning.
Poster
Carlos Esteves · Mohammed Suhail · Ameesh Makadia
[ Exhibit Hall I ]
Abstract
Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster scan order, which is not ideal for autoregressive modeling. In this paper, we propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT), such that the sequence of tokens represents the image in a coarse-to-fine fashion. Our tokenizer brings several advantages: 1) it leverages that natural images are more compressible at high frequencies, 2) it can take and reconstruct images of different resolutions without retraining, 3) it improves the conditioning for next-token prediction -- instead of conditioning on a partial line-by-line reconstruction of the image, it takes a coarse reconstruction of the full image, 4) it enables partial decoding where the first few generated tokens can reconstruct a coarse version of the image, 5) it enables autoregressive models to be used for image upsampling. We evaluate the tokenizer reconstruction metrics as well as multiscale image generation, text-guided image upsampling and editing.
Poster
Zeyinzi Jiang · Zhen Han · Chaojie Mao · Jingfeng Zhang · Yulin Pan · Yu Liu
[ Exhibit Hall I ]
Abstract
Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation.However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging.We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing.These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU).Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly.Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations.
Poster
yifei feng · Mx Yang · Shuhui Yang · Sheng Zhang · Jiaao Yu · Zibo Zhao · Lliu Yuhong · Jie Jiang · Chunchao Guo
[ Exhibit Hall I ]
Abstract
Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objectsTo overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis.Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images.Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
Poster
Jiaqi Liao · Zhengyuan Yang · Linjie Li · Dianqi Li · Kevin Lin · Yu Cheng · Lijuan Wang
[ Exhibit Hall I ]
Abstract
In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL).While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. However, we observe that MLLMs often produce unstructured reasoning steps, resulting in suboptimal outcomes. To tackle this issue, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. However, due to the complexity of T2I-ICL tasks, there is still significant room for improvement. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain by varying the random seed. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks.
Poster
Eric Slyman · Mehrab Tanjim · Kushal Kafle · Stefan Lee
[ Exhibit Hall I ]
Abstract
Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called **M**ultimodal **M**ixture-of-**B**ayesian Prompt Ensembles (MMB). Our approach uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.
Poster
Jisoo Kim · Wooseok Seo · Junwan Kim · Seungho Park · Sooyeon Park · Youngjae Yu
[ Exhibit Hall I ]
Abstract
Despite the remarkable success of text-to-video (T2V) generation, its large memory requirements limit deployment in resource-constrained environments, leading to extensive research on model pruning and knowledge distillation to enhance efficiency while preserving performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT) loss, which, due to the reduced capacity of pruned models, struggles to capture fine-grained details. This leads to averaged predictions and ultimately degrades overall quality. To mitigate this challenge, we propose an effective distillation method, \loss, that combines DPO and SFT, leveraging DPO’s ability to guide the student model in learning preferences for its limiting properties while de-emphasizing less critical ones, complemented by SFT to enhance overall performance. Along with \loss, our framework, \ours includes filtering and curation for high-quality datasets, as well as a step-by-step online approach for more effective learning. We implement our method on two baseline models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2\% in VideoCrafter and 67.5\% in AnimateDiff motion module, while maintaining or even surpassing the performance of full models. Further experiments validate the effectiveness of our \loss loss and \ours framework, demonstrating their impact on efficient and high-quality video generation.
Poster
Renxi Cheng · Hongsong Wang · Yang Zhang · Chaolei Han · Jie Gui
[ Exhibit Hall I ]
Abstract
The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture the intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of 98.9% (11.9%$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion …
Poster
Lennart Bastian · Mohammad Rashed · Nassir Navab · Tolga Birdal
[ Exhibit Hall I ]
Abstract
Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness.We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths.Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters.By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings.
Poster
Zichen Liu · Yihao Meng · Hao Ouyang · Yue Yu · Bolin Zhao · Daniel Cohen-Or · Huamin Qu
[ Exhibit Hall I ]
Abstract
Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. The animation is represented by a canonical field that aggregates the semantic content in a canonical shape and a deformation field that applies per-frame motion to deform the canonical shape. Two fields are jointly optimized by the priors from a large pretrained text-to-video diffusion model using score-distillation loss with designed regularization, encouraging the video coherence with the intended textual concept while maintaining legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our methodology over baselines. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability.
Poster
Sicheng Zhang · Binzhu Xie · Zhonghao Yan · Yuli Zhang · Donghao Zhou · Xiaofei Chen · Shi Qiu · Jiaqi Liu · Guoyang Xie · Zhichao Lu
[ Exhibit Hall I ]
Abstract
Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models’ complex trade-offs among these dimensions have been rarely explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) using a single metric for multiple dimensions. To address this gap, we introduce **TRIG-Bench** (**Tr**ade-offs in **I**mage **G**eneration), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity and Bias), contains over 40,200 samples, and covers 132 **Pairwise Dimensional Subsets.** Furthermore, we develop **TRIGScore,** a VLM-as-judge metric that automatically adapts to various dimensions. Based on this, we evaluate 14 cutting-edge models across T2I and I2I tasks. In addition, we propose the Relation Recognition System and generate the Dimension Trade-off Map (**DTM**), which visualizes model-specific capability trade-offs. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generation models. Notably, after fine-tuning on DTM, the model's dimension-specific impact is mitigated, and overall performance is enhanced.
Poster
Zeyi Sun · Ziyang Chu · Pan Zhang · Tong Wu · Xiaoyi Dong · Yuhang Zang · Yuanjun Xiong · Dahua Lin · Jiaqi Wang
[ Exhibit Hall I ]
Abstract
Recent advances in large language models have enabled task prompting for open-ended text generation. In the vision domain, a longstanding goal is developing models capable of general visual learning, encompassing tasks such as image generation, editing, low-level processing, and dense perception. Although recent efforts have aimed at building vision foundation models that support prompting, significant challenges remain, particularly in accurately comprehending visual prompts and addressing the ambiguity inherent in textual prompts. To address this, we introduce X-Prompt, a purely auto-regressive large vision-language model designed for generalizable visual learning via in-context prompting. X-Prompt can process visual and textual prompts as context, enabling precise task interpretation and accurate execution. A novel prompt-token fusion mechanism effectively extracts relevant task information from complex prompts while significantly reducing the token length. Additionally, a unified training strategy for text and image prediction enhances task awareness, enabling seamless adaptation to open-ended prompts. Extensive experiments demonstrate that X-Prompt effectively interprets in-context prompts and exhibits generalization across both in-domain and out-of-domain visual tasks, paving the way for future advancements in general visual learning.
Poster
Yuwei Guo · Ceyuan Yang · Ziyan Yang · Zhibei Ma · Zhijie Lin · Zhenheng Yang · Dahua Lin · Lu Jiang
[ Exhibit Hall I ]
Abstract
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and semantic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that extends the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including composable generation and interactive shot extension, paving the way for more practical visual content creation.
Poster
Junjia Huang · Pengxiang Yan · Jiyang Liu · Jie Wu · Zhao Wang · Yitong Wang · Liang Lin · Guanbin Li
[ Exhibit Hall I ]
Abstract
Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results.Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
Poster
Ziye Li · Xincheng Shuai · Hao Luo · Henghui Ding
[ Exhibit Hall I ]
Abstract
Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves state-of-the-art performance, marking a significant advancement in spatial- and motion-controlled video generation.
Poster
Jiarui Wang · Huiyu Duan · Yu Zhao · Juntong Wang · Guangtao Zhai · Xiongkuo Min
[ Exhibit Hall I ]
Abstract
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation,which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models.Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perceptual quality, text-image correspondence, and task-specific accuracy.Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric.Both EvalMi-50K and LMM4LMM will be released upon the publication.
Poster
Pengzhen Chen · Yanwei Liu · Xiaoyan Gu · Enci Liu · Zhuoyi Shang · Xiangyang Ji · Wu Liu
[ Exhibit Hall I ]
Abstract
Diffusion models have significantly advanced the field of image synthesis, making the protection of their intellectual property (IP) a critical concern. Existing IP protection methods primarily focus on embedding watermarks into generated images by altering the structure of the diffusion process. However, these approaches inevitably compromise the quality of the generated images and are particularly vulnerable to fine-tuning attacks, especially for open-source models such as Stable Diffusion (SD). In this paper, we propose PlugMark, a novel plug-in zero-watermarking framework for diffusion models. The core idea of PlugMark is based on two observations: a classifier can be uniquely characterized by its decision boundaries, and a diffusion model can be uniquely represented by the knowledge acquired from training data.Building on this foundation, we introduce a diffusion knowledge extractor that can be plugged into a diffusion model to extract its knowledge and output a classification result. PlugMark subsequently generates boundary representations based on this classification result, serving as a zero-distortion watermark that uniquely represents the decision boundaries and, by extension, the knowledge of the diffusion model. Since only the extractor requires training, the performance of the original diffusion model remains unaffected.Extensive experimental results demonstrate that PlugMark can robustly extract high-confidence zero-watermarks from both …
Poster
Yongsheng Yu · Ziyun Zeng · Haitian Zheng · Jiebo Luo
[ Exhibit Hall I ]
Abstract
Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing.
Poster
Yuxin Jiang · Liming Jiang · Shuai Yang · Jia-Wei Liu · Ivor Tsang · Mike Zheng Shou
[ Exhibit Hall I ]
Abstract
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments. Code and models will be released.
Poster
Ge Gao · Siyue Teng · Tianhao Peng · Fan Zhang · David Bull
[ Exhibit Hall I ]
Abstract
While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a **G**enerative **I**mplicit **Vi**deo **C**ompression framework, **GIViC**, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting *long-term dependencies*. Through the newly designed *implicit diffusion* process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel **Hierarchical Gated Linear Attention-based transformer** (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the **first INR-based video codec that outperforms VTM based on the RA coding configuration**. The source code will be made available.
Poster
Peyman Gholami · Robert Xiao
[ Exhibit Hall I ]
Abstract
Denoising diffusion models have emerged as powerful tools for image manipulation, yet interactive, localized editing workflows remain underdeveloped. We introduce Layered Diffusion Brushes (LDB), a novel framework that facilitates real-time and iterative image editing with fine-grained, region-specific control. LDB leverages a unique approach that caches intermediate latent states within the diffusion process, enabling users to apply prompt-guided edits via masks in a non-destructive, layered manner. Key innovations include latent caching for significant speed enhancements (achieving edits in under 140ms on consumer GPUs) and redefining layering for diffusion models with an order-agnostic system that allows for independent manipulation and stacking of edits, even in overlapping regions. An editor implementing LDB, incorporating familiar layer concepts, was evaluated through user study and quantitative metrics. Results demonstrate LDB's superior speed alongside comparable or improved image quality, background preservation, and edit fidelity relative to existing state-of-the-art techniques across various sequential image manipulation tasks. The findings highlight LDB's potential to significantly enhance creative workflows by providing an intuitive and efficient approach to diffusion-based image editing and its potential for expansion into related subdomains, such as video editing.
Poster
Tianyu Zhang · Xin Luo · Li Li · Dong Liu
[ Exhibit Hall I ]
Abstract
Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoders and decoders, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable …
Poster
Peng Zheng · Junke Wang · Yi Chang · Yizhou Yu · Rui Ma · Zuxuan Wu
[ Exhibit Hall I ]
Abstract
Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces $\textbf{DisCon}$ (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of $\textbf{1.38}$ on ImageNet 256$\times$256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
Poster
ZiYi Dong · Chengxing Zhou · Weijian Deng · Pengxu Wei · Xiangyang Ji · Liang Lin
[ Exhibit Hall I ]
Abstract
Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics. Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed. Driven by this, we propose \(\Delta\)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks (\(\Delta\)ConvBlocks). By distilling attention patterns into localized convolutional operations while keeping other components frozen, \(\Delta\)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929× and surpassing LinFusion by 5.42× in efficiency—all without compromising generative fidelity.
Poster
Hongwei Yu · Xinlong Ding · Jiawei Li · Jinlong Wang · Yudong Zhang · Rongquan Wang · Huimin Ma · Jiansheng Chen
[ Exhibit Hall I ]
Abstract
While image conditional diffusion models demonstrate impressive generation capabilities, they exhibit high vulnerability when facing backdoor and adversarial attacks. In this paper, we define a scenario named diffusion anomaly where generated results of a reverse process under attack deviate significantly from the normal ones. By analyzing the underlying formation mechanism of the diffusion anomaly, we reveal how perturbations are amplified during the reverse process and accumulated in the results. Based on the analysis, we reveal the phenomena of divergence and homogeneity, which cause the diffusion process to deviate significantly from the normal process and to decline in diversity. Leveraging these two phenomena, we propose a method named Diffusion Anomaly Detection (DADet) to effectively detect both backdoor and adversarial attacks. Extensive experiments demonstrate that our proposal achieves excellent defense performance against backdoor and adversarial attacks. Specifically, for the backdoor attack detection, our method achieves an F1 score of 99\% on different datasets including MS COCO and CIFAR-10. For the detection of adversarial samples, the F1 score exceeds 84\% across three adversarial attacks and two different tasks, evaluated on the MS COCO and Places365 datasets respectively.
Poster
Junho Lee · Jeongwoo Shin · Hyungwook Choi · Joonseok Lee
[ Exhibit Hall I ]
Abstract
In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs).
Poster
Yukai Shi · Jiarong Ou · Rui Chen · Haotian Yang · Jiahao Wang · Xin Tao · Pengfei Wan · Di ZHANG · Kun Gai
[ Exhibit Hall I ]
Abstract
In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.
Poster
Junyuan Zhang · Qintong Zhang · Bin Wang · Linke Ouyang · Zichen Wen · Ying Li · Ka-Ho Chow · Conghui He · Wentao Zhang
[ Exhibit Hall I ]
Abstract
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining.As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR).However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises.In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems.OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q\&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG.To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise.Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems.We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG …
Poster
Yang Liu · Xudong Xie · Yuliang Liu · Xiang Bai
[ Exhibit Hall I ]
Abstract
Overlapping text poses significant challenges for text-related perception tasks, particularly in open scenes characterized by diverse fonts and visual effects. While existing research has primarily addressed the overlapping problem in documents, its applicability to other scenes remains limited. To bridge this gap, we propose a new task of multi-scenario overlapping text segmentation and introduce a corresponding real dataset in both English and Chinese, spanning various contexts such as printed text, bills, artistic designs, and house numbers. To further enhance the generalization of overlapping text segmentation models, we propose a hierarchical training data synthesis strategy that simulates diverse overlapping patterns across different scenarios. Furthermore, we found that depth maps can provide clear relative position relationships in three-dimensional space, assisting the model in capturing complex overlapping relationships between text instances. Building on this insight, we present a depth-guided decoder that seamlessly integrates image and depth features to capture overlapping interactions. Our proposed model achieves a 5.3% improvement in text mIoU and a 6.4% improvement in overall mIoU compared to existing SOTA methods on our benchmark and SignaTR6k datasets, respectively.
Poster
Xingsong Ye · Yongkun Du · Yunbo Tao · Zhineng Chen
[ Exhibit Hall I ]
Abstract
Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available in Supplementary Materials.
Poster
Zexuan Yan · Yue Ma · Chang Zou · Wenteng Chen · Qifeng Chen · Linfeng Zhang
[ Exhibit Hall I ]
Abstract
Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named \textbf{EEdit}, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. \textbf{For spatial redundancy}, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. \textbf{For temporal redundancy}, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of \textbf{\textcolor{blue}{2.46}}$\times$ acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available in the supplementary material and will be released on Github.
Poster
Yuhan Li · Xianfeng Tan · Wenxiang Shang · Yubo Wu · Jian Wang · Xuanhong Chen · Yi Zhang · Zhu Hangcheng · Bingbing Ni
[ Exhibit Hall I ]
Abstract
Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
Poster
Xi Yu · Xiang Gu · Zhihao Shi · Jian Sun
[ Exhibit Hall I ]
Abstract
Large-scale text-to-image diffusion models have achieved remarkable success in image generation, thereby driving the development of stylized image generation technologies. Recent studies introduce style information by empirically replacing specific features in attention block with style features. However, the relationship between features and style remains unclear. In this paper, we systematically analyze the relationship between features in attention blocks and style. By quantifying the distribution discrepancy induced by style variations using the Wasserstein distance, we find that features in self-attention blocks exhibit high sensitivity to style compared to features in cross-attention blocks. Our analysis provides valuable insights into the contribution of different features to style. Based on our findings, we propose a novel Wasserstein Style Distribution Transform (WSDT) method, which generates stylized images by transforming the distribution of style-sensitive features to align with that of style features. WSDT applies channel adaptive distribution transform to ensure that information not related to the style is not introduced. Our approach is simple yet efficient, optimization-free, and can be seamlessly integrated into attention-based text-to-image diffusion models. Extensive experiments demonstrate the effectiveness of our approach in stylized image generation tasks.
Poster
Liya Ji · Chenyang Qi · Qifeng Chen
[ Exhibit Hall I ]
Abstract
Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation.Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only \textit{single} modality ability, restricting the editing quality.We aim to bridge understanding and generation via a new \textit{multi-modality} model that provides the intelligent abilities to instruction-based image editing models for more complex cases.To achieve this goal, we separate the instruction editing task with the multi-modality chain of thought prompts, \ie, Chain-of-Thought (CoT) planning, editing region reasoning, and editing, individually. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network.For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, for editing image generations, a hint-guided instruction-based editing network is proposed based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images. Source codes will be publicly available.
Poster
Lisa Dunlap · Trevor Darrell · Joseph Gonzalez · Fabian Caba Heilbron · Josef Sivic · Bryan Russell
[ Exhibit Hall I ]
Abstract
In this paper, we investigate when and how visual representations learned by two different generative models {\bf diverge} from each other. Specifically, given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model’s outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts.We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate our method's ability to find diverging representations, we create an automated data generation pipeline to produce ID$^2$, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we apply CompCon to compare two popular text to image models, PixArt and SD-Lightning. We find diverging representations such as how prompts mentioning loneliness result in depictions of "wet streets" in PixArt, as well as bias like how PixArt generates older men for prompts mentioning traditional professions.
Poster
Aryan Yazdan Parast · Basim Azam · Naveed Akhtar
[ Exhibit Hall I ]
Abstract
Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model’s reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves …
Poster
Yuanshen Guan · Ruikang Xu · Yinuo Liao · Mingde Yao · Lizhi Wang · Zhiwei Xiong
[ Exhibit Hall I ]
Abstract
While diffusion models have demonstrated significant success in standard dynamic range (SDR) image synthesis, generating high dynamic range (HDR) images with higher luminance and broader color gamuts remains challenging. This arises primarily from two factors: (1) The incompatibility between pretrained SDR image auto-encoders and the high-bit-depth HDR images; (2) The lack of large-scale HDR image datasets for effective learning and supervision. In this paper, we propose a novel framework for HDR image generation with two key innovations: (1) Decomposed HDR Image Generation: We leverage a double-layer HDR image format to decompose the HDR image into two low-bit-depth components: an SDR image with a corresponding Gain Map (GM).This format is inherently compatible with pretrained SDR auto-encoders, motivating the decomposition of HDR image generation into SDR image and GM prediction. (2) Unsupervised Data Construction: We develop an automated pipeline to construct ``Text-SDR-GM" triplets from large-scale text-image datasets by brightness-aware compression and gamut-constrained reduction, enabling unsupervised learning of GMs without ground-truth data. Building upon these innovations, we adapt the Stable Diffusion model to jointly predict GMs and SDR images, enabling high-quality decomposed HDR image generation. Experiments show that our framework excels in HDR image generation and SDR-to-HDRTV up-conversion, generalizing well across diverse scenes …
Poster
Jongseo Lee · Kyungho Bae · Kyle Min · Gyeong-Moon Park · Jinwoo Choi
[ Exhibit Hall I ]
Abstract
In this work, we tackle the problem of video class-incremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance.To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL).We are inspired by the human memory system, which integrates episodic and semantic memory for accurate information retrieval.ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts.We introduce a novel memory retrieval (MR) module that integrates episodic and semantic memory through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features.We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark.Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.
Poster
Yufan Liu · Wanqian Zhang · Huashan Chen · Lin Wang · Xiaojun Jia · Zheng Lin · Weiping Wang
[ Exhibit Hall I ]
Abstract
Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist.Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).
Poster
Jeonghoon Park · Juyoung Lee · Chaeyeon Chung · Jaeseong Lee · Jaegul Choo · Jindong Gu
[ Exhibit Hall I ]
Abstract
Recent advancements in diffusion-based text-to-image (T2I) models have enabled the generation of high-quality and photorealistic images from text descriptions. However, they often exhibit societal biases related to gender, race, and socioeconomic status, thereby reinforcing harmful stereotypes and shaping public perception in unintended ways. While existing bias mitigation methods demonstrate effectiveness, they often encounter attribute entanglement, where adjustments to attributes relevant to the bias (\ie, target attributes) unintentionally alter attributes unassociated with the bias (\ie, non-target attributes), causing undesirable distribution shifts. To address this challenge, we introduce Entanglement-Free Attention (EFA), a method that accurately incorporates target attributes (\eg, African, Asian, and Indian) while preserving non-target attributes (\eg, background details) during bias mitigation. At inference time, EFA randomly samples a target attribute with equal probability and adjusts the cross-attention in selected layers to incorporate the sampled attribute, achieving a fair distribution of target attributes. Extensive experiments demonstrate that EFA outperforms existing methods in mitigating bias while preserving non-target attributes, thereby maintaining the output distribution and generation capability of the original model.
Poster
Yufei Wang · Lanqing Guo · Zhihao Li · Jiaxing Huang · Pichao WANG · Bihan Wen · Jian Wang
[ Exhibit Hall I ]
Abstract
Text-guided image editing is an essential task, enabling users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token re-assembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive …
Poster
Jiahui Yang · Yongjia Ma · Donglin Di · Hao Li · Chen Wei · Xie Yan · Jianxun Cui · Xun Yang · Wangmeng Zuo
[ Exhibit Hall I ]
Abstract
Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes, but suffer from cross-attribute interference when combining multiple LoRA models. This interference stems from unstructured modifications of weight matrices, particularly evident in content-style fusion tasks where merging adaptations leads to undesired feature entanglement.We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations.Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices.Extensive experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.
Poster
jing Yang · Qunliang Xing · Mai Xu · Minglang Qiao
[ Exhibit Hall I ]
Abstract
Joint Photographic Experts Group (JPEG) achieves data compression by quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably introduces compression artifacts. Most existing JPEG quality enhancement methods operate in the pixel domain, suffering from the high computational costs of decoding. Consequently, direct enhancement of JPEG images in the DCT domain has gained increasing attention. However, current DCT-domain methods often exhibit limited performance. To address this challenge, we identify two critical types of correlations within the DCT coefficients of JPEG images. Building on this insight, we propose an Advanced DCT-domain JPEG Quality Enhancement (AJQE) method that fully exploits these correlations. The AJQE method enables the adaptation of numerous well-established pixel-domain models to the DCT domain, achieving superior performance with reduced computational complexity. Compared to the pixel-domain counterparts, the DCT-domain models derived by our method demonstrate a 0.35 dB improvement in PSNR and a 60.5% increase in enhancement throughput on average. The code will be made publicly available.
Poster
Junxiang Qiu · Lin Liu · Shuo Wang · Jinda Lu · Kezhou Chen · Yanbin Hao
[ Exhibit Hall I ]
Abstract
Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50\% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC's superior trade-off between efficiency and quality. With 50\% cached blocks, GOC achieves IS 216.28 (26.3\%↑) and FID 3.907 (43\%↓) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements. The code is availableat Supplementary …
Poster
Ruoyu Wang · Huayang Huang · Ye Zhu · Olga Russakovsky · Yu Wu
[ Exhibit Hall I ]
Abstract
In this work, we introduce **NoiseQuery** as a novel method for enhanced noise initialization in versatile goal-driven text-to-image (T2I) generation. Specifically, we propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts, for better generation quality and controllability. Unlike existing noise optimization methods designed for specific models, our approach is grounded in a fundamental examination of the generic finite-step noise scheduler design in diffusion formulation, allowing better generalization across different diffusion-based architectures in a **tuning-free manner**. This model-agnostic nature allows us to construct a reusable noise library compatible with multiple T2I models and enhancement techniques, serving as a foundational layer for more effective generation. Extensive experiments demonstrate that **NoiseQuery** enables fine-grained control and yields significant performance boosts not only over high-level semantics but also over **low-level visual attributes**, which are typically difficult to specify through text alone, with seamless integration into current workflows with minimal computational overhead.
Poster
Aniruddha Mahapatra · Long Mai · David Bourgin · Yitian Zhang · Feng Liu
[ Exhibit Hall I ]
Abstract
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4× without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.
Poster
Shengqi Liu · Yuhao Cheng · Zhuo Chen · Xingyu Ren · Wenhan Zhu · Lincheng Li · Mengxiao Bi · Xiaokang Yang · Yichao Yan
[ Exhibit Hall I ]
Abstract
Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose **SewingLDM**, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, i.e., body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability.
Poster
Shijie Huang · Yiren Song · Yuxuan Zhang · Hailong Guo · Xueyin Wang · Jiaming Liu
[ Exhibit Hall I ]
Abstract
We introduce ArtEditor, a novel framework for instruction-based image editing that learns unique editing styles from few-shot examples. While image editing has seen significant advancements, customized instructional editing remains underexplored. Existing methods often rely on complex, multi-stage pipelines that are difficult to adapt to specific styles. Additionally, this domain lacks a standardized benchmark, making it challenging to evaluate progress. To address these issues, we propose ArtEditor, a two-stage training framework. In the first stage, we train ArtEditor-Base, a general-purpose image editing model, on large-scale datasets to build a strong foundational capability. In the second stage, we fine-tune this model using ArtEditor-LoRA, a lightweight adaptation module, on a small dataset of before-and-after image pairs. This approach enables the model to efficiently learn distinct editing styles and techniques with minimal data. To enhance the performance of a pre-trained Diffusion Transformer (DiT) model, we introduce two key innovations: position encoding cloning and a noise-free conditioning paradigm. These techniques ensure stable and coherent edits, even when adapting to new styles. To support research in this area, we contribute the DoodleArt dataset, the first benchmark specifically designed for customized image editing. DoodleArt features six high-quality artistic styles created by professional artists and designers, providing a …
Poster
Yikang Zhou · Tao Zhang · Shilin Xu · Shihao Chen · Qianyu Zhou · Yunhai Tong · Shunping Ji · Jiangning Zhang · Lu Qi · Xiangtai Li
[ Exhibit Hall I ]
Abstract
Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. To our knowledge, this is the first MLLMs dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15% and 11.72% OA, respectively. These results …
Poster
Yunqiu Xu · Linchao Zhu · Yi Yang
[ Exhibit Hall I ]
Abstract
While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and follow three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, along with our developed simple yet effective agentic baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some insightful observations that suggest potential future directions. We hope that MC-Bench and our empirical findings encourage the research community to further explore and enhance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. …
Poster
zikai zhou · Shitong Shao · Lichen Bai · Shufei Zhang · zhiqiang xu · Bo Han · Zeke Xie
[ Exhibit Hall I ]
Abstract
Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are "golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the noise prompt, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the noise prompt learning framework that systematically learns "prompted'' golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale noise prompt dataset (NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small noise prompt network (NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of …
Poster
JUNHAO WEI · YU ZHE · Jun Sakuma
[ Exhibit Hall I ]
Abstract
Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal's robustness. Our code is available in the appendix.
Poster
Guanjie Chen · Xinyu Zhao · Yucheng Zhou · Xiaoye Qu · Tianlong Chen · Yu Cheng
[ Exhibit Hall I ]
Abstract
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. Through systematic analysis, we identify the absence of long-range feature preservation mechanisms as the root cause of unstable feature propagation and perturbation sensitivity. To this end, we propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets. Theoretical spectral norm and visualization analysis demonstrate how LSCs stabilize feature dynamics. Skip-DiT architecture and its stabilized dynamic feature enable an efficient statical caching mechanism that reuses deep features across timesteps while updating shallow components. Extensive experiments across image and video generation tasks demonstrate that Skip-DiT achieves: (1) **4.4$\times$** training acceleration and faster convergence, (2) **1.5-2$\times$** inference acceleration without quality loss and high fidelity to original output, outperforming existing DiT caching methods across various quantitative metrics. Our findings establish long-skip connections as critical architectural components for training stable and efficient diffusion transformers. Codes are provided in the anonymous URL https://anonymous.4open.science/r/Skip-DiT-72B7/.
Poster
Yihong Luo · Tianyang Hu · Jiacheng Sun · Yujun Cai · Jing Tang
[ Exhibit Hall I ]
Abstract
Accelerating diffusion model sampling is crucial for efficient AIGC deployment. While diffusion distillation methods -- based on distribution matching and trajectory matching -- reduce sampling to as few as one step, they fall short on complex tasks like text-to-image generation. Few-step generation offers a better balance between speed and quality, but existing approaches face a persistent trade-off: distribution matching lacks flexibility for multi-step sampling, while trajectory matching often yields suboptimal image quality.To bridge this gap, we propose learning few-step diffusion models by Trajectory Distribution Matching (TDM), a unified distillation paradigm that combines the strengths of distribution and trajectory matching. Our method introduces a data-free score distillation objective, aligning the student’s trajectory with the teacher’s at the distribution level. Further, we develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling.This approach supports both deterministic sampling for superior image quality and flexible multi-step adaptation, achieving state-of-the-art performance with remarkable efficiency. Our model, TDM, outperforms existing methods on various backbones, such as SDXL and PixArt-$\alpha$, delivering superior quality and significantly reduced training costs.In particular, our method distills PixArt-$\alpha$ into a 4-step generator that outperforms its teacher on real user preference at 1024 resolution. This is accomplished with …
Poster
LiWei Wang · YanDuo Zhang · Tao Lu · Fang Liu · Huiqin Zhang · Jiayi Ma · Huabing Zhou
[ Exhibit Hall I ]
Abstract
Dynamic Scene Graph Generation (DSGG) aims to comprehensively understand videos by abstracting them into visual triplets $<$\textit{subject}, \textit{predicate}, \textit{object}$>$. Most existing methods focus on capturing temporal dependencies, but overlook crucial visual relationship dependencies between entities and predicates, as well as among predicate subclasses. These dependencies are essential for a deeper contextual understanding of scenarios. Additionally, current approaches do not support end-to-end training and instead rely on a two-stage pipeline, which incurs higher computational costs. To address these issues, we propose an end-to-end \textbf{A}ssociation \textbf{R}easoning \textbf{N}etwork (ARN) for DSGG. ARN leverages CLIP’s semantic priors to model fine-grained triplet cues to generate scene graph. In addition, we design a Predicate Association Parsing (PAP) module that employs a conditional weight mapping mechanism to structure entity and predicate representations. We further introduce a Hierarchical Attention (HA) mechanism to integrate spatio-temporal context with entity and predicate representations, enabling effective associative reasoning. Extensive experiments on the Action Genome dataset demonstrate significant performance improvements over existing methods.
Poster
Size Wu · Wenwei Zhang · Lumin Xu · Sheng Jin · Zhonghua Wu · Qingyi Tao · Wentao Liu · Wei Li · Chen Change Loy
[ Exhibit Hall I ]
Abstract
Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present Harmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval (instruction alignment) and MJHQ30K (visual quality) benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be released.
Poster
Zhiyuan Fang · Rengan Xie · Xuancheng Jin · Qi Ye · Wei Chen · Wenting Zheng · Rui Wang · Yuchi Huo
[ Exhibit Hall I ]
Abstract
Recently, the field of 3D scene stylization has attracted considerable attention, particularly for applications in the metaverse. A key challenge is rapidly transferring the style of an arbitrary reference image to a 3D scene while faithfully preserving its content structure and spatial layout. Works leveraging implicit representations with gradient-based optimization achieve impressive style transfer results, yet the lengthy processing time per individual style makes rapid switching impractical. In this paper, we propose A$^3$GS, a novel feed-forward neural network for zero-shot 3DGS stylization that enables transferring any image style to arbitrary 3D scenes in just 10 seconds without the need for per-style optimization. Our work introduces a Graph Convolutional Network (GCN)-based autoencoder aimed at efficient feature aggregation and decoding of spatially structured 3D Gaussian scenes. The encoder converts 3DGS scenes into a latent space. Furthermore, for the latent space, we utilize Adaptive Instance Normalization (AdaIN) to inject features from the target style image into the 3D Gaussian scene. Finally, we constructed a 3DGS dataset using a generative model and proposed a two-stage training strategy for A$^3$GS. Owing to the feed-forward design, our framework can perform fast style transfer on large-scale 3DGS scenes, which poses a severe challenge to the memory consumption …
Poster
Ata Çelen · Iro Armeni · Daniel Barath · Marc Pollefeys
[ Exhibit Hall I ]
Abstract
We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
Poster
Mingqi Fang · Ziguang Li · Lingyun Yu · Quanwei Yang · Hongtao Xie · Yongdong Zhang
[ Exhibit Hall I ]
Abstract
Recently, synthetic images have evolved incredibly realistic with the development of generative techniques.To avoid the spread of misinformation and identify synthetic content, research on synthetic image detection becomes urgent. Unfortunately, limited to the singular forensic perspective, existing methods struggle to explore sufficient traces encountered with diverse synthetic techniques. In response to this, we argue that different synthetic images encompass a variety of forensic traces, and utilizing multiple experts to explore traces from diverse perspectives will be beneficial. Accordingly, a novel detector with the **M**ixture **o**f multiple forensic **E**xperts is proposed, named **Forensic-MoE**. To integrate multiple experts and enhance the knowledge interaction, Forensic-MoE follows an adapter-backbone architecture. Specifically, multiple adapters trained on different synthetic images serve as the trace exploration experts, and they are uniformly integrated into a pretrained backbone model to learn the detection prior and encourage the expert interaction. By guiding multiple experts to align with each other and collaborate together, Forensic-MoE can integrate comprehensive and discriminative detection traces from multiple perspectives. Moreover, for the discrimination improvement of each expert, a multi-stage structure is proposed for efficient trace perception, and a patch decentralization strategy is applied to encourage the model's attention on every local region. Extensive experiments demonstrate the …
Poster
Tomoyuki Suzuki · Kang-Jun Liu · Naoto Inoue · Kota Yamaguchi
[ Exhibit Hall I ]
Abstract
Designers craft and edit graphic designs in a layer representation, but layer-based editing becomes impossible once composited into a raster image.In this work, we propose LayerD, a method to decompose raster graphic designs into layers for re-editable creative workflow.LayerD addresses the decomposition task by iteratively extracting unoccluded foreground layers and completing the background.We propose a simple yet effective refinement approach taking advantage of the assumption that layers often exhibit uniform appearance in graphic designs.As decomposition is ill-posed and ground-truth layer structure may not be reliable, we develop a metric that measures the quality of the decomposition.In experiments, we show that LayerD successfully achieves high quality decomposition and outperforms baselines.We also demonstrate the use of LayerD with state-of-the-art image generators and layer-based editing.
Poster
Jonas Belouadi · Eddy Ilg · Margret Keuper · Hideki Tanaka · Masao Utiyama · Raj Dabre · Steffen Eger · Simone Paolo Ponzetto
[ Exhibit Hall I ]
Abstract
With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like Ti*k*Z, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting Ti*k*Zero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, Ti*k*Zero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models will be made publicly available.
Poster
Marc Lafon · Yannis Karmim · Julio Silva-Rodríguez · Paul Couairon · Clément Rambour · Raphael Fournier-Sniehotta · Ismail Ayed · Jose Dolz · Nicolas THOME
[ Exhibit Hall I ]
Abstract
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification.
Poster
Grzegorz Gruszczynski · Jakub Meixner · Michał Włodarczyk · Przemyslaw Musialski
[ Exhibit Hall I ]
Abstract
We propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Péclet, Fourier). We implement this PDE numerically through a GPU-accelerated custom Lattice Boltzmann solver for fast evaluation. To induce realistic ``turbulence,'' we generate stochastic velocity fields that introduce coherent motion and capture introduce multi-scale mixing. A diffusion model then learns to invert the advection-diffusion operator, reconstructing fine details from coarsely transported images and thus constituting a novel generative diffusion model. We discuss how previews methods emerge as specific cases (zero velocity or zero blur) of our operator, demonstrating that our advection-diffusion framework generalizes prior PDE-based diffusion techniques. This work bridges fluid dynamics, dimensionless PDE theory, and deep generative modeling, offering a fresh perspective on physically informed image corruption processes for diffusion-based synthesis.
Poster
Ziyi Liu · Zhe Xu · Jiabo MA · Wenqiang Li · Ruixuan Wang · Bo Du · Hao Chen
[ Exhibit Hall I ]
Abstract
Pathological image has been recognized as the gold standard for cancer diagnosis for more than a century. However, some internal regions of pathological images may inevitably exhibit various degradation issues, including low resolution, image blurring, and image noising, which will affect disease diagnosis, staging, and risk stratification.Existing pathological image restoration methods were mainly based on generative adversarial networks (GANs) to improve image quality, which are limited by the inherent instability and loss of structural details, often resulting in artifacts in the restored images.Large scale of whole slide images (WSIs) also makes it hard for efficient processing and restoration. To address these limitations, we propose a conditional visual autoregressive model (CVARPath) for next-scale token prediction, guided by the degraded tokens from the current scale. We introduce a novel framework that employs quantified encoders specifically designed for pathological image generation, which learns consistent sparse vocabulary tokens through self-supervised contrastive learning. Furthermore, our method efficiently compresses image patches into compact degraded sparse tokens at smaller scales and reconstructs high-quality large-scale whole slide images (WSIs). This is achieved using only an 8×8 vocabulary index for 256×256 images while maintaining minimal reconstruction loss. Experimental results demonstrate that our approach significantly enhances image quality, achieving an …
Poster
Haoyang Chen · Dongfang Sun · Caoyuan Ma · Shiqin Wang · Kewei Zhang · Zheng Wang · Zhixiang Wang
[ Exhibit Hall I ]
Abstract
We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images.Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision.Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.
Poster
Wenxue Li · Tian Ye · Xinyu Xiong · Jinbin Bai · feilong tang · Wenxuan Song · Zhaohu Xing · Lie Ju · Guanbin Li · Lei Zhu
[ Exhibit Hall I ]
Abstract
Glass Surface Detection (GSD) is a critical task in computer vision, enabling precise interactions with transparent surfaces and enhancing both safety and object recognition accuracy. However, current research still faces challenges in both recognition performance and generalization capability. Thanks to the recent advanced diffusion-based generative models, GSD task can benefit from rich prior knowledge encapsulated in pre-trained Stable Diffusion (SD) model. Thus, in this paper, we present GlassWizard, aiming to harvest priors in diffusion-based model to achieve accurate and generalized GSD. Firstly, we delve into the text embedding space in SD to build an text-based context prior, thereby enhancing the understanding of implicit attribute of glass and achieving fine-grained predictions. Secondly, we train an end-to-end diffusion model with a one-step formulation pipeline, yielding effective optimization and fast inference. In addition, to facilitate our adapted framework scalable to other multi-modal GSD tasks (such as RGB-D/RGB-T GSD), we present a modality-customized adaptation, that can achieve rapid adaptation to multi-modal GSD tasks. Our experimental results demonstrate that our proposed framework achieves cutting-edge performance across diverse datasets, and it also shows strong generalization ability. Additionally, it excels in multi-modal GSD tasks, confirming its scalability across different modalities. The code will be publicly released.
Poster
Chen Yi Lu · Mehrab Tanjim · Ishita Dasgupta · Somdeb Sarkhel · Gang Wu · Saayan Mitra · Somali Chaterji
[ Exhibit Hall I ]
Abstract
We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots with an efficient beam-search algorithm guided by the LCA score. To train our model effectively with limited human annotations, we propose two tasks for the LCA encoder: Shot Coherence Learning, which uses contrastive learning to distinguish coherent and incoherent sequences, and Feature Regression, which converts these learned representations into a real-valued coherence score. We develop two variants: a base SKALD model that relies solely on visual coherence and SKALD-text, which integrates auxiliary text information when available. Experiments on the VSPD and our curated MSV3C datasets show that SKALD achieves an improvement of up to 48.6% in IoU and a 43% speedup over the state-of-the-art methods. A user study further validates our approach, with 45% of participants favoring SKALD-assembled videos, compared to 22% preferring text-based assembly methods.
Poster
Mainak Singha · Subhankar Roy · Sarthak Mehrotra · Ankit Jha · Moloud Abdar · Biplab Banerjee · Elisa Ricci
[ Exhibit Hall I ]
Abstract
Textual prompt tuning adapts Vision-Language Models (e.g., CLIP) in federated learning by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. Post training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning often struggles with overfitting to known concepts and may be overly reliant on memorized text features, limiting its adaptability to unseen concepts. To address this limitation, we propose Federated Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on comprehensive contextual information -- image-conditioned features and textual attribute features of a class -- that is multimodal in nature. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through cross-attention, enabling richer contexual integration. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets spanning three generalization settings demonstrates that \method not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains when compared to state-of-the-art methods.
Poster
Mert Sonmezer · Matthew Zheng · Pinar Yanardag
[ Exhibit Hall I ]
Abstract
Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.
Poster
Yi-Hsin Chen · Yi-Chen Yao · Kuan-Wei Ho · Chun-Hung Wu · Huu-Tai Phung · Martin Benjak · Jörn Ostermann · Wen-Hsiao Peng
[ Exhibit Hall I ]
Abstract
Most frame-based learned video codecs can be interpreted as recurrent neural networks (RNNs) propagating reference information along the temporal dimension. This work revisits the limitations of the current approaches from an RNN perspective. The output-recurrence methods, which propagate decoded frames, are intuitive but impose dual constraints on the output decoded frames, leading to suboptimal rate-distortion performance. In contrast, the hidden-to-hidden connection approaches, which propagate latent features within the RNN, offer greater flexibility but require large buffer sizes. To address these issues, we propose HyTIP, a learned video coding framework that combines both mechanisms. Our hybrid buffering strategy uses explicit decoded frames and a small number of implicit latent features to achieve competitive coding performance. Experimental results show that our HyTIP outperforms the sole use of either output-recurrence or hidden-to-hidden approaches. Furthermore, it achieves comparable performance to state-of-the-art methods but with a much smaller buffer size, and outperforms VTM 17.0 (Low-delay B) in terms of PSNR-RGB and MS-SSIM-RGB.
Poster
Kazuma Nagata · Naoshi Kaneko
[ Exhibit Hall I ]
Abstract
Automatic colorization methods for line drawings have been widely studied to reduce the labor cost of hand-drawn anime production. Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. Our method fuses low-resolution semantic features from foundation models with high-resolution spatial features from CNNs for fine-grained yet robust feature extraction.In contrast to previous methods that rely on Multiplex Transformer and support only one or two reference images, DACoN removes this constraint, allowing any number of references. Quantitative evaluations demonstrate the advantages of using multiple reference images, achieving superior colorization performance.Our code and model will be released upon acceptance.
Poster
Divyansh Srivastava · Xiang Zhang · He Wen · Chenru Wen · Zhuowen Tu
[ Exhibit Hall I ]
Abstract
We present Lay-Your-Scene (shorthand LayouSyn, a novel text to layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware Diffusion-Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn: First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.
Poster
Jaemin Kim · Bryan Sangwoo Kim · Jong Ye
[ Exhibit Hall I ]
Abstract
Diffusion models have achieved impressive results in generative tasks for text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions trained for video, hindering their scalability and applicability. In this paper, we propose Free$^2$Guide, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess text-to-video alignment, we leverage stitching between video frames and use system prompts to capture sequential attributions. Our framework supports the flexible ensembling of multiple reward models to synergistically enhance alignment without significant computational overhead. Experimental results confirm that Free$^2$Guide using image-trained VLVMs significantly improves text-to-video alignment, thereby enhancing the overall video quality. Our results and code are available at https://free2guide.github.io/
Poster
Keming Wu · Junwen Chen · Zhanhao Liang · Yinuo Wang · Ji Li · Chao Zhang · Bin Wang · Yuhui Yuan
[ Exhibit Hall I ]
Abstract
Text-to-image generation models often struggle to interpret spatially aware text prompts effectively. To overcome this, existing approaches typically require millions of high-quality semantic layout annotations consisting of bounding boxes and regional prompts. This paper shows that the large amounts of regional prompts are non-necessary for the latest diffusion transformers like SD3 or FLUX.In this paper, we propose an efficient hybrid layout framework for diffusion transformers. Our approach drastically reduces need for extensive layout annotations and minimizes reliance on regional prompt annotations—incurring only minimal additional computational cost during inference—while maintaining high-quality layout adherence. Our key insight is to break the layout-control task into two sequential stages: first, generating the target objects within the designated regions specified by an anonymous layout, and second, refining these outputs to ensure they strictly adhere to the regional prompts in the semantic layout. Building on this insight, we propose a hybrid layout control scheme that first fine-tunes the DiTs (e.g., SD3) to follow an anonymous layout, then continues fine-tuning the DiTs to follow the semantic layout, and finally includes a quality-tuning stage to enhance visual aesthetics. We show that this hybrid design is highly data-efficient, as we find only using a small amount of semantic layout …
Poster
Tao Han · Wanghan Xu · Junchao Gong · Xiaoyu Yue · Song Guo · Luping Zhou · LEI BAI
[ Exhibit Hall I ]
Abstract
Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds. The demo page, code, and pre-trained models are available at:\url{https://anonymous.4open.science/r/InfGen-7257}.
Poster
Yazhou Xing · Yang Fei · Yingqing He · Jingye Chen · Pengjun Fang · Xiaowei Chi · Qifeng Chen
[ Exhibit Hall I ]
Abstract
Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation results in temporal inconsistencies and fails to compress temporal redundancy effectively. Existing works on Video VAEs compress temporal redundancy but struggle to handle videos with large motion effectively. They suffer from issues such as severe image blur and loss of detail in scenarios with large motion. In this paper, we present a powerful video VAE named VideoVAE+ that effectively reconstructs videos with large motion. First, we investigate two architecture choices and propose our simple yet effective architecture with better spatiotemporal joint modeling performance. Second, we propose to leverage the textual information in existing text-to-video datasets and incorporate text guidance during training. The textural guidance is optional during inference. We find that this design enhances the reconstruction quality and preservation of detail. Finally, our models achieve strong performance compared with various baseline approaches in both general videos and large motion videos, demonstrating its effectiveness on the challenging large motion scenarios.
Poster
Zihan Ding · Chi Jin · Difan Liu · Haitian Zheng · Krishna Kumar Singh · Qiang Zhang · Yan Kang · Zhe Lin · Yuchen Liu
[ Exhibit Hall I ]
Abstract
Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model’s diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.
Poster
Grace Luo · Jonathan Granskog · Aleksander Holynski · Trevor Darrell
[ Exhibit Hall I ]
Abstract
Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes.
Poster
Inzamamul Alam · Md Islam · Simon Woo · Khan Muhammad
[ Exhibit Hall I ]
Abstract
Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval’s theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark's invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models.
Poster
SeungHoo Hong · GeonHo Son · Juhun Lee · Simon Woo
[ Exhibit Hall I ]
Abstract
Diffusion models have shown to be strong representation learners, showcasing state-of-the-art performance across multiple domains. Aside from accelerated sampling, DDIM also enables the inversion of real images back to their latent codes. A direct inheriting application of this inversion operation is real image editing, where the inversion yields latent trajectories to be utilized during the synthesis of the edited image. Unfortunately, this practical tool has enabled malicious users to freely synthesize misinformative or deepfake contents with greater ease, which promotes the spread of unethical and abusive, as well as privacy-, and copyright-infringing contents. While defensive algorithms such as AdvDM and Photoguard have been shown to disrupt the diffusion process on these images, the misalignment between their objectives and the iterative denoising trajectory at test time results in weak disruptive performance. In this work, we present the \textbf{D}DIM \textbf{I}nversion \textbf{A}ttack (DIA) that attacks the integrated DDIM trajectory path. Our results support the effective disruption, surpassing previous defensive methods across various editing methods. We believe that our frameworks and results can provide practical defense methods against the malicious use of AI for both the industry and the research community. Our code is available here: \url{https://anonymous.4open.science/r/DIA-13419/}.
Poster
Viet Nguyen · Anh Nguyen · Trung Dao · Khoi Nguyen · Cuong Pham · Toan Tran · Anh Tran
[ Exhibit Hall I ]
Abstract
The escalating demand for real-time image synthesis has driven significant advancements in one-step diffusion models, which inherently offer expedited generation speeds compared to traditional multi-step methods. However, this enhanced efficiency is frequently accompanied by a compromise in the controllability of image attributes. While negative prompting, typically implemented via classifier-free guidance (CFG), has proven effective for fine-grained control in multi-step models, its application to one-step generators remains largely unaddressed. Due to the lack of iterative refinement, as in multi-step diffusion, directly applying CFG to one-step generation leads to blending artifacts and diminished output quality. To fill this gap, we introduce Negative-Away Steer Attention (NASA), a training-free method that integrates negative prompts into one-step diffusion models. NASA operates within the intermediate representation space by leveraging cross-attention mechanisms to suppress undesired visual attributes. This strategy avoids the blending artifacts inherent in output-space guidance and achieves high efficiency, incurring only a minimal 1.89\% increase in FLOPs compared to the computational doubling of CFG. Furthermore, NASA can be seamlessly integrated into existing timestep distillation frameworks, enhancing the student's output quality. Experimental results demonstrate that NASA substantially improves controllability and output quality, achieving an HPSv2 score of 31.21, setting a new state-of-the-art benchmark for one-step diffusion …
Poster
Tianyang Xue · Lin Lu · Yang Liu · Mingdong Wu · Hao Dong · Yanbin Zhang · Renmin Han · Baoquan Chen
[ Exhibit Hall I ]
Abstract
2D irregular packing is a classic combinatorial optimization problem with various applications, such as material utilization and texture atlas generation. Due to its NP-hard nature, conventional numerical approaches typically encounter slow convergence and high computational costs. Previous research (GFPack) introduced a generative method for gradient-based packing, providing early evidence of its feasibility but faced limitations such as insufficient rotation support, poor boundary adaptability, and high overlap ratios. In this paper, we propose GFPack++, a deeply investigated framework that adopts attention-based geometry and relation encoding, enabling more comprehensive modeling of complex packing relationships. We further design a constrained gradient and a weighting function to enhance both the feasibility of the produced solutions and the learning effectiveness. Experimental results on multiple datasets demonstrate that GFPack++ achieves higher space utilization, supports continuous rotation, generalizes well to arbitrary boundaries, and infers orders of magnitude faster than previous approaches. We plan to release our code and datasets to advance further research in 2D irregular packing.
Poster
Ting Yao · Yehao Li · Yingwei Pan · Zhaofan Qiu · Tao Mei
[ Exhibit Hall I ]
Abstract
Autoregressive models are just at a tipping point where they could really take off for visual generation. In this paper, we propose to model token prediction using diffusion procedure particularly in masked autoregressive models for image generation. We look into the problem from two critical perspectives: progressively refining the unmasked tokens prediction via a denoising head with the autoregressive model, and representing masked tokens probability distribution by capitalizing on the interdependency across masked and unmasked tokens through a diffusion head. Our proposal harbors an innate agency that remains advantageous in the speed of sequence prediction, and strongly favors high capability in generating quality samples by leveraging the principles of denoising diffusion process. Extensive experiments on both class-conditional and text-to-image tasks demonstrate its superiority, achieving the state-of-the-art FID score of 1.47 and 5.27 on ImageNet and MSCOCO datasets, respectively. More remarkably, our approach leads to 45\% speedup in the inference time of image generation against the diffusion models such as DiT-XL/2.
Poster
Yecheng Wu · Han Cai · Junyu Chen · Zhuoyang Zhang · Enze Xie · Jincheng YU · Junsong Chen · Jinyi Hu · Yao Lu · Song Han
[ Exhibit Hall I ]
Abstract
We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT— a deep compression hybrid tokenizer for AR models that achieves a 32$\times$ spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens.DC-AR achieves state-of-the-art results with a gFID of $\textbf{5.49}$ on MJHQ-30K and an overall score of $\textbf{0.69}$ on GenEval, while offering $\textbf{1.5-7.9}\times$ higher throughput and $\textbf{2.0-3.5}\times$ lower latency compared to prior leading diffusion and masked autoregressive models. We will release the code and pre-trained models upon publication.
Poster
Léopold Maillard · Tom Durand · Adrien RAMANANA RAHARY · Maks Ovsjanikov
[ Exhibit Hall I ]
Abstract
Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.
Poster
Prasen Kumar Sharma · Neeraj Matiyali · Siddharth Srivastava · Gaurav Sharma
[ Exhibit Hall I ]
Abstract
We introduce Preserve Anything, a novel method for con-trolled image synthesis that addresses key limitations in ob-ject preservation and semantic consistency in text-to-image(T2I) generation. Existing approaches often fail (i) to pre-serve multiple objects with fidelity, (ii) maintain semanticalignment with prompts, or (iii) provide explicit control overscene composition. To overcome these challenges, the pro-posed method employs an N-channel ControlNet that inte-grates (i) object preservation with size and placement ag-nosticism, color and detail retention, and artifact elimi-nation, (ii) high-resolution, semantically consistent back-grounds with accurate shadows, lighting, and prompt ad-herence, and (iii) explicit user control over background lay-outs and lighting conditions. Key components of our frame-work include object preservation and background guid-ance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mit-igating unwanted artifacts. We introduce a benchmarkdataset consisting of 240K natural images filtered for aes-thetic quality and 18K 3D-rendered synthetic images withmetadata such as lighting, camera angles, and object rela-tionships. This dataset addresses the deficiencies of existingbenchmarks and allows a complete evaluation. Empiricalresults demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fi-delity (FID 15.26) and semantic alignment (CLIP-S 32.85)while maintaining competitive aesthetic quality. We alsoconducted a user study to demonstrate the efficacy of theproposed work …
Poster
XUN WU · Shaohan Huang · Lingjie Jiang · Furu Wei
[ Exhibit Hall I ]
Abstract
Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. However, We identify two potential risks for existing DPO algorithms: First, current DPO methods for estimating the rewards of step-wise intermediate samples are biased, leading to inaccurate preference ordering for step-wise optimization. Second, existing DPO methods may inadvertently increase the sampling probabilities of dispreferred samples, potentially introducing application risks. To address these issues, we propose Revised Direct Preference Optimization (RDPO), a simple but effective step-wise DPO-based text-to-image diffusion model alignment method. By designing a more theoretically grounded and efficient intermediate-step reward estimation and introducing an additional regularization terms to constrain the sampling probability of dispreferred samples, RDPO can achieve more effective and stable text-to-image alignment performance. Our experiments on two datasets, with base models including Stable Diffusion v1.5 and SDXL, demonstrate that RDPO can effectively learn and construct reward signals for each step of the model, improving alignment performance while ensuring better generalization.
Poster
Carl Olsson · Yaroslava Lochman · Johan Malmport · Christopher Zach
[ Exhibit Hall I ]
Abstract
Rotation averaging is a key subproblem in applications of computer vision and robotics. Many methods for solving this problem exist, and there are also several theoretical results analyzing difficulty and optimality. However, one aspect that most of these have in common is a focus on the isotropic setting, where the intrinsic uncertainties in the measurements are not fully incorporated into the resulting optimization task. Recent empirical results suggest that moving to an anisotropic framework, where these uncertainties are explicitly included, can result in an improvement of solution quality. However, global optimization for rotation averaging has remained a challenge in this scenario.In this paper we show how anisotropic costs can be incorporated in certifiably optimal rotation averaging. We also demonstrate how existing solvers, designed for isotropic situations, fail in the anisotropic setting. Finally, we propose a stronger relaxation and show empirically that it is able to recover global optima in all tested datasets and leads to a more accurate reconstruction in all but one of the scenes.
Poster
Ava Pun · Kangle Deng · Ruixuan Liu · Deva Ramanan · Changliu Liu · Jun-Yan Zhu
[ Exhibit Hall I ]
Abstract
We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during auto-regressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method, enabling us to generate colored and textured designs. We show that our designs can be assembled by humans manually as well as by robotic arms automatically. Upon publication, we will release our new dataset, StableText2Lego, which contains over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models.
Poster
Imad Eddine MAROUF · Enzo Tartaglione · Stéphane Lathuilière · Joost van de Weijer
[ Exhibit Hall I ]
Abstract
Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across both visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the current task’s answer space, addressing the out-of-answer-set problem. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA. The source code, provided in the supplementary material, will be publicly released upon acceptance.
Poster
Tuna Meral · Enis Simsar · Federico Tombari · Pinar Yanardag
[ Exhibit Hall I ]
Abstract
Low-Rank Adaptation (LoRA) has emerged as a powerful and popular technique for personalization, enabling efficient adaptation of pre-trained image generation models for specific tasks without comprehensive retraining. While employing individual pre-trained LoRA models excels at representing single concepts, such as those representing a specific dog or a cat, utilizing multiple LoRA models to capture a variety of concepts in a single image still poses a significant challenge. Existing methods often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). We introduce CLoRA, a training-free approach that addresses these limitations by updating the attention maps of multiple LoRA models at test-time, and leveraging the attention maps to create semantic masks for fusing latent representations. This enables the generation of composite images that accurately reflect the characteristics of each LoRA. Our comprehensive qualitative and quantitative evaluations demonstrate that CLoRA significantly outperforms existing methods in multi-concept image generation using LoRAs.
Poster
Hui Li
[ Exhibit Hall I ]
Abstract
Generative AI (GenAI), which revolutionized both computer vision and natural language processing, has drawn continuous attention recently. Benefits from GenAI with the evolution of large language models (LLMs), the image generation task evolved from prompt-based to dialogue-based, which takes the real-world human intent expressed through conversations. When breaking this task into multiple steps, the best pathway of analyzing the dialogues is not determined, such as whether the objects or prompted template should be focused on the first step of dialogues analyzing. Thus, a multi-chain reasoning is requested to decompose this application beyond a pure chain-of-thought structure. After the divergent process, the question comes to how to converge the thinking chain that leads to the best matched image, which requires a new evaluation method to lead the thinking process. To address these challenges, we propose the LLM Thought Divergence and Convergence (LTDC) framework, which simulates human cognitive processes through three phases: (1) The Step-by-Step Thought process decomposes dialogue-based image generation tasks into sequential thinking chains using LLMs; (2) The Image Generation process creates image prompts following these thought instructions and produces corresponding images; (3) The Evaluation process aligns the coherence between generated images and dialogues through a multi-modal LLM, guiding the …
Poster
Yukang Cao · Chenyang Si · Jinghao Wang · Ziwei Liu
[ Exhibit Hall I ]
Abstract
We present **FreeMorph**, the first tuning-free method for image morphing that accommodates inputs with varying semantics or layouts. Unlike existing methods, which rely on fine-tuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without extensive training. Despite its efficiency and potential, tuning-free methods still face challenges in maintaining high-quality image morphing due to the non-linear nature of the multi-step denoising process and bias inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address this challenge by integrating two key innovations.**1)** We first propose a **guidance-aware spherical interpolation** design that incorporates the explicit guidance from the input images by modifying the self-attention modules, addressing identity loss, and ensuring directional transitions throughout the generated sequences. **2)** We further introduce a **step-oriented variation trend** that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both input images. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods with training that is 10X ~ 50X faster, establishing a new state-of-the-art for image morphing. The code will be released.
Poster
Yabo Zhang · xinpeng zhou · Yihan Zeng · Hang Xu · Hui Li · Wangmeng Zuo
[ Exhibit Hall I ]
Abstract
Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions.However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency.In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency.Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals.Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens.We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, e.g., automatically adjust the reflection of the cup.Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, …
Poster
Zhongdao Wang · Guodongfang Zhao · Jingjing Ren · bailan feng · Shifeng Zhang · Wenbo Li
[ Exhibit Hall I ]
Abstract
Diffusion-based generative models have demonstrated exceptional promise in super-resolution (SR) tasks, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. When the input is video, the problem becomes even more pronounced. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: **(1)** We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. **(2)** Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. **(3)** We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference.As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes …
Poster
Runze He · bo cheng · Yuhang Ma · QingxiangJia QingxiangJia · Shanyuan Liu · Ao Ma · Xiaoyu Wu · Liebucha Wu · Dawei Leng · Yuhui Yin
[ Exhibit Hall I ]
Abstract
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layout-related tasks, showing its great potential.
Poster
Jingjing Ren · Wenbo Li · Zhongdao Wang · Haoze Sun · Bangzhen Liu · Haoyu Chen · Jiaqi Xu · Aoxue Li · Shifeng Zhang · Bin Shao · Yong Guo · Lei Zhu
[ Exhibit Hall I ]
Abstract
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals.While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs.In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer.Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced …
Poster
Taihang Hu · Linxuan Li · Kai Wang · Yaxing Wang · jian Yang · Ming-Ming Cheng
[ Exhibit Hall I ]
Abstract
Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (*ISLock*), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, *ISLock* preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (*ATM*) protocol. By implicitly enforcing structural consistency in latent space, our method *ISLock* enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that *ISLock* achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models.
Poster
Xixi Hu · Runlong Liao · Bo Liu · Keyang Xu · Yeqing Li · Eugene Ie · Hongliang Fei · qiang liu
[ Exhibit Hall I ]
Abstract
Rectified Flow offers a simple and effective approach to high-quality generative modeling by learning a velocity field. However,we identify a limitation in directly modeling the velocity with an unconstrained neural network: the learned velocity often fails to satisfy certain boundary conditions, leading to inaccurate velocity field estimations that deviate from the desired ODE. This issue is particularly critical during stochastic sampling at inference, as the score function's errors are amplified near the boundary. To mitigate this, we propose a Boundary-enforced Rectified Flow Model (Boundary RF Model), in which we enforce boundary conditions with a minimal code modification. Boundary RF Model improves performance over vanilla RF model, demonstrating 8.01% improvement in FID score on ImageNet using ODE sampling and 8.98% improvement using SDE sampling.
Poster
Yunze Tong · Fengda Zhang · Didi Zhu · Jun Xiao · Kun Kuang
[ Exhibit Hall I ]
Abstract
The fundamental requirement for text-to-image generation is aligning the generated images with the provided text. With large-scale data, pre-trained Stable Diffusion (SD) models have achieved remarkable performance in this task. These models process an input prompt as text control, guiding a vision model to perform denoising operations that recover a clean image from pure noise. However, we observe that when there is correlation among text tokens, SD’s generated images fail to accurately represent the semantics of the input prompt: simple yet crucial objects may be omitted, thereby disrupting text-image alignment. We refer to this problem as *"object omission"*. Without additional external knowledge, previous methods have been ineffective at addressing this issue. To investigate this problem, we analyze the attention maps in SD and find that biased text representations mislead the visual denoising process when handling correlated tokens, impeding object generation. Moreover, we observe that even when two prompts share the same semantics, slight variations in token sequence significantly alter attention scores, consequently affecting the final generated images. Based on these findings, we propose a simple yet effective fine-tuning method that applies decorrelation to the self-attention maps in the text module, thus reducing dependencies between tokens. Our approach requires no external …
Poster
Wenqi Ouyang · Zeqi Xiao · Danni Yang · Yifan Zhou · Shuai Yang · Lei Yang · Jianlou Si · Xingang Pan
[ Exhibit Hall I ]
Abstract
Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions.Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations.
Poster
Haoming Cai · Tsung-Wei Huang · Shiv Gehlot · Brandon Feng · Sachin Shah · Guan-Ming Su · Christopher Metzler
[ Exhibit Hall I ]
Abstract
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training—no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
Poster
Yadong Qu · Shancheng Fang · Yuxin Wang · Xiaorui Wang · Zhineng Chen · Hongtao Xie · Yongdong Zhang
[ Exhibit Hall I ]
Abstract
Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. Notably, IGD is the first method to combine creativity with the ability to generate editable multimodal layers. The superior experimental results demonstrate that IGD offers a new solution for graphic design.
Poster
Mengchen Zhang · Tong Wu · Jing Tan · Ziwei Liu · Gordon Wetzstein · Dahua Lin
[ Exhibit Hall I ]
Abstract
Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movements generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements for camera control and filmmaking. Our code and data …
Poster
Tianrun Xu · Guanyu Chen · Ye Li · Xi Yuxin · Zeyu Mu · Ruichen Wang · Tianren Zhang · Haichuan Gao · Feng Chen
[ Exhibit Hall I ]
Abstract
Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue.However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model’s own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a 1.4M-image dataset. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. The results demonstrate the effectiveness of our method in advancing scene understanding and multimodal reasoning. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets will be released upon acceptance.
Poster
Yu-Ju Tsai · Brian Price · Qing Liu · Luis Figueroa · Daniil Pakhomov · Zhihong Ding · Scott Cohen · Ming-Hsuan Yang
[ Exhibit Hall I ]
Abstract
Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques.
Poster
Xingjian Leng · Jaskirat Singh · Yunzhong Hou · Zhenchang Xing · Saining Xie · Liang Zheng
[ Exhibit Hall I ]
Abstract
In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training for both VAE and diffusion-model using standard diffusion-loss is ineffective, causing the VAE to converge to trivial solutions and degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss $-$ allowing both encoder and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over $17\times$ and $45\times$ over REPA and vanilla training recipes, respectively. Interestingly, we observe that once tuned from the end-to-end training, the VAE can be reused for downstream generation tasks; exhibiting significantly accelerated generation performance across diverse diffusion architectures and training settings.
Poster
Xuan-Hao Liu · Bao-liang Lu · Wei-Long Zheng
[ Exhibit Hall I ]
Abstract
Generating high fidelity video from brain activity is an important milestone in brain decoding research. Previous works were mostly based on functional Magnetic Resonance Imaging (fMRI), whose low temporal resolution confines the ability of faithfully reflecting rapid brain activity, motivating us to turn to high temporal resolution brain signals like electroencephalography (EEG). However, EEG-to-video is challenging due to the complexity and nonstationarity of EEG signals and the scarcity of data annotations. Addressing these issues, we present **EEGMirror**. Firstly, we adopt neural quantization for converting nonstationary raw EEG signals into robust discrete representation. Afterwards, a masked self-supervision method with montage-agnostic position embedding (MAPE) is introduced. By MAPE, EEGMirror can process EEG data with various montages (number and position of channels) and thus can flexibly leverage different EEG datasets to acquire an effective EEG encoder, mitigating the lack of well-annotated EEG data. Next, multimodal contrastive learning is applied to align brain modality with dynamic changes and semantic information. Lastly, a fine-tuned inflated Stable Diffusion model is adopted to reconstruct video stimuli guided by visual and semantic information decoded from EEG signals. We show that EEGMirror outperforms the state-of-the-art performance in both semantic (82.1\% vs 79.8\%) and pixel (0.261 vs 0.256) levels. An …
Poster
shangwen zhu · Han Zhang · Zhantao Yang · Qianyu Peng · Zhao Pu · Huangji Wang · Fan Cheng
[ Exhibit Hall I ]
Abstract
Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel $\textbf{training-free}$ acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of $\mathbf{1.67\times}$ in Stable Diffusion v2 and a speedup of $\mathbf{1.55\times}$ in video generation models. When combined with distillation …
Poster
Zerui Gong · Zhonghua Wu · Qingyi Tao · Qinyue Li · Chen Change Loy
[ Exhibit Hall I ]
Abstract
Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure.Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. SA-LUT features: (1) a Style-guided 4D LUT Generator that extracts multi-scale features from the style image to predict a 4D LUT, and (2) a Context Generator using content-style cross-attention to produce a context map. This context map enables spatially-adaptive adjustments, allowing our 4D LUT to apply precise color transformations while preserving structural integrity. To establish a rigorous evaluation framework for photorealistic style transfer, we introduce PST50, the first benchmark specifically designed for PST assessment. Experiments demonstrate that SA-LUT substantially outperforms state-of-the-art methods, achieving a 66.7% reduction in LPIPS score compared to 3D LUT approaches, while maintaining real-time performance at 16 FPS for video stylization. Code and benchmark will be released.
Poster
Jingyi Lu · Kai Han
[ Exhibit Hall I ]
Abstract
Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512×512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance.
Poster
Songhua Liu · Ruonan Yu · Xinchao Wang
[ Exhibit Hall I ]
Abstract
Given a source image, personalized text-to-image generation produces images preserving the identity and appearance while following the text prompts. Existing methods heavily rely on test-time optimization to achieve this customization. Although some recent works are dedicated to zero-shot personalization, they still require re-training when applied to different text-to-image diffusion models. In this paper, we instead propose a model-agnostic personalized method termed UniversalBooth. At the heart of our approach lies a novel cross-attention mechanism, where different blocks in the same diffusion scale share common square mappings for key and value, which decouples the image feature encoder from the diffusion architecture while maintaining its effectiveness. Moreover, the cross-attention performs hierarchically: the holistic attention first captures the global semantics of user inputs for textual combination with editing prompts, and the fine-grained attention divides the holistic attention scores for various local patches to enhance appearance consistency. To improve the performance when deployed on unseen diffusion models, we further devise an optimal transport prior to the model and encourage the attention scores allocated by cross-attention to fulfill the optimal transport constraint. Experiments demonstrate that our personalized generation model can be generalized to unseen text-to-image diffusion models with a wide spectrum of architectures and functionalities without …
Poster
Haoxuan Wang · Jinlong Peng · Qingdong He · Hao Yang · Ying Jin · Jiafu Wu · Xiaobin Hu · Yanjie Pan · Zhenye Gan · Mingmin Chi · Bo Peng · Yabiao Wang
[ Exhibit Hall I ]
Abstract
With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance. Our code and dataset will be released soon.
Poster
Yuanrui Wang · Cong Han · Yafei Li · Zhipeng Jin · Xiawei Li · Sinan Du · Wen Tao · Yi Yang · shuanglong li · Chun Yuan · LIU LIN
[ Exhibit Hall I ]
Abstract
Text-to-image generation has transformed content creation, yet precise visual text rendering remains challenging for generative models due to blurred glyphs, semantic inconsistencies, and limited style controllability. Current methods typically employ pre-rendered glyph images as conditional inputs, but their inability to preserve original font styles and color information forces reliance on multi-branch architectures to compensate for missing details. This leads to increased model complexity, higher computational costs, and reduced reusability.To address these limitations, we propose a segmentation-guided framework that leverages pixel-level visual text segmentation masks—complete representations preserving glyph shapes, colors, and spatial details—as unified conditional inputs. Our approach integrates two key innovations: (1) a fine-tuned bilingual segmentation model for extracting precise text masks from source images, and (2) a streamlined diffusion model enhanced with adaptive glyph condition and glyph region loss to ensure semantic and stylistic fidelity. On the AnyText-benchmark, our method achieves a sentence accuracy (Sen.Acc) of 0.8267 and a Normalized Edit Distance (NED) of 0.8976 for Chinese text generation, while the English test set delivers even stronger performance with 0.9018 Sen.Acc and 0.9582 NED, surpassing prior methods by substantial margins. To address broader evaluation needs, we introduce two novel benchmarks: GlyphMM-benchmark (for holistic glyph consistency assessment) and MiniText-benchmark (targeting …
Poster
Sherry Chen · Yi Wei · Luowei Zhou · Suren Kumar
[ Exhibit Hall I ]
Abstract
Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, automated dataset creation approaches and scorer for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model. The resulting scorer out-performs all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0706 (+17.48%) gain in score correlation with human ratings on AURORA-Bench and improving pair-wise comparison accuracy by 3.48% (+6.22%) on GenAI-Bench and 1.57% (+3.09%) on AURORA-Bench compared to the state-of-the-art. It can also enhance image editing models as a reward model, boosting the average evaluation score of edit outputs with respect to ImagenHub from 6.15 to 6.67 (+8.46%). Our code and dataset will be released upon acceptance.
Poster
Xiang Lv · Mingwen Shao · Lingzhuang Meng · Chang Liu · Yecong Wan · Xinyuan Chen
[ Exhibit Hall I ]
Abstract
Recently, text-driven diffusion models have significantly promoted the development of video editing. However, there still remain two practical challenges: (1) existing text-to-video editing methods struggle to understand negative text prompt, resulting in ineffective suppression of undesirable content in edited video; (2) these methods are difficult to maintain the temporal consistency of the edited video, leading to inter-frame flickering. To address the above challenges, we propose SUV, a novel semantic modulation method based on text embeddings to suppress undesired content in the edited video. Specifically, on the one hand, we discover that the end embeddings (EE) contain substantial coupled positive and negative embeddings, which is the primary reason for the appearance of undesirable content in the edited video. Based on this discovery, we advocate decoupling the negative embeddings from the EE by employing singular value decomposition and propose an exponential suppression operator to decrease the singular values of negative embeddings, thereby restraining the effect of negative embeddings on the edited video content. Subsequently, two constraints are designed to further suppress negative content while keep positive content unchanged via pushing negative embeddings apart and pulling positive embeddings closer. On the other hand, to boost the temporal consistency of edited video, we devise …
Poster
Haonan Wang · Qixiang ZHANG · Lehan Wang · Xuanqi Huang · Xiaomeng Li
[ Exhibit Hall I ]
Abstract
Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI-to-video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre-trained text-to-video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state-of-the-art baselines, achieving solid improvements in video consistency (26.6%) and semantic-level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications. The code will be released upon acceptance.
Poster
Teng-Fang Hsiao · Bo-Kai Ruan · Yi-Lun Wu · Tzu-Ling Lin · Hong-Han Shuai
[ Exhibit Hall I ]
Abstract
Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking—this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
Poster
Wang Ziye · Minghang Yu · Chunyan Xu · Zhen Cui
[ Exhibit Hall I ]
Abstract
With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned concepts of pre-trained models are critical for identifying forged images. However, misalignment between the forgery and concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction techniques to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision-language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery and concepts. A concept-level forgery discrepancy learning module, based on reconstruction, enhances the interaction between concepts and forgeries, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forged feature enhancement integrates the learned concept-level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods.
Poster
Shyamgopal Karthik · Huseyin Coskun · Zeynep Akata · Sergey Tulyakov · Jian Ren · Anil Kag
[ Exhibit Hall I ]
Abstract
Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset "Syn-Pic" improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.
Poster
Yilei Jiang · Wei-Hong Li · Yiyuan Zhang · Minghong Cai · Xiangyu Yue
[ Exhibit Hall I ]
Abstract
While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set.As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, FairGen consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process.Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for re-training. …
Poster
Habin Lim · Youngseob Won · Juwon Seo · Gyeong-Moon Park
[ Exhibit Hall I ]
Abstract
In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects at an image has gained much more attention. The main challenge of this task is “concept mixing”, where multiple learned concepts interfere or blend undesirably in the output image.To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference.
Poster
Qihang Yu · Ju He · Xueqing Deng · Xiaohui Shen · Liang-Chieh (Jay) Chen
[ Exhibit Hall I ]
Abstract
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made publicly available.
Poster
Dongwon Kim · Ju He · Qihang Yu · Chenglin Yang · Xiaohui Shen · Suha Kwak · Liang-Chieh (Jay) Chen
[ Exhibit Hall I ]
Abstract
Image tokenizers form the foundation of modern text-toimage generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce **T**ext-**A**ware **T**ransformer-based 1-D**i**mensional **Tok**enizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image **Mask**ed **Gen**erative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
Poster
Yitian Zhang · Long Mai · Aniruddha Mahapatra · David Bourgin · Yicong Hong · Jonah Casebeer · Feng Liu · Yun Fu
[ Exhibit Hall I ]
Abstract
We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32× (8× higher than leading video emebbders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
Poster
Wenda SHI · Yiren Song · Dengming Zhang · Jiaming Liu · XINGXING ZOU
[ Exhibit Hall I ]
Abstract
Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on 5% key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.
Poster
Xiangxiang Chu · Renda Li · Yong Wang
[ Exhibit Hall I ]
Abstract
Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available.
Poster
Hui Zhang · Dexiang Hong · Yitong Wang · Jie Shao · Xinglong Wu · Zuxuan Wu · Yu-Gang Jiang
[ Exhibit Hall I ]
Abstract
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box …
Poster
Yachun Mi · Yu Li · Weicheng Meng · Chaofeng Chen · Chen Hui · Shaohui Liu
[ Exhibit Hall I ]
Abstract
The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods …
Poster
Dongyeun Lee · jiwan hur · Hyounguk Shon · Jae Young Lee · Junmo Kim
[ Exhibit Hall I ]
Abstract
Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing PTQ techniques, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model …
Poster
Zizhuo Li · Yifan Lu · Linfeng Tang · Shihua Zhang · Jiayi Ma
[ Exhibit Hall I ]
Abstract
This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch’s promising accuracy, efficiency, …
Poster
Dat Cong · Hieu Tran · Hoang Thanh-Tung
[ Exhibit Hall I ]
Abstract
Diffusion models have gained prominence as state-of-the-art techniques for synthesizing images and videos, particularly due to their ability to scale effectively with large datasets. Recent studies have uncovered that these extensive datasets often contain mistakes from manual labeling processes. However, the extent to which such errors compromise the generative capabilities and controllability of diffusion models is not well studied. This paper introduces Score-based Discriminator Correction (SBDC), a guidance technique for aligning noisy pre-trained conditional diffusion models. The guidance is built on discriminator training using adversarial loss, drawing on prior noise detection techniques to assess the authenticity of each sample.We further show that limiting the usage of our guidance to the early phase of the generation process leads to better performance.Our method is computationally efficient, only marginally increases inference time, and does not require retraining diffusion models.Experiments on different noise settings demonstrate the superiority of our method over previous state-of-the-art methods.
Poster
Junyi Guo · Jingxuan Zhang · Fangyu Wu · Huanda Lu · Qiufeng Wang · Wenmian Yang · ENG Gee LIM · Dongming Lu
[ Exhibit Hall I ]
Abstract
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence.To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
Poster
Xiaomeng Fu · Jia Li
[ Exhibit Hall I ]
Abstract
Diffusion models have achieved remarkable success in image and video generation due to their powerful generative capabilities. However, they suffer from slow inference speed and high computational costs. Existing acceleration methods for diffusion models may compromise model performance and struggle to generalize across diverse diffusion model architectures and downstream tasks. To address these issues, we propose a model-agnostic and highly scalable acceleration strategy for text-controlled image generation. Specifically, we dynamically modulate the text guidance coefficience and truncate redundant text-related computations during the denoising process. Experimental results demonstrate that our approach achieves significant model acceleration while preserving precise text-image alignment, showcasing the potential for a wide range of diffusion models and downstream applications.
Poster
Yujie Zhang · Bingyang Cui · Qi Yang · Zhu Li · Yiling Xu
[ Exhibit Hall I ]
Abstract
Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging for two reasons: i) existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions; ii) previous evaluation metrics only focus on a single aspect (e.g., text-3D alignment) and fail to perform multi-dimensional quality assessment. To address these problems, we first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes. We have conducted a large-scale subjective experiment from four different evaluation dimensions and collected 107,520 annotations, followed by detailed analyses of the results. Based on MATE-3D, we propose a novel quality evaluator named HyperScore. Utilizing hypernetwork to generate specified mapping functions for each evaluation dimension, our metric can effectively perform multi-dimensional quality assessment. HyperScore presents superior performance over existing metrics on MATE-3D, making it a promising metric for assessing and improving text-to-3D generation.
Poster
Wei Xu · Kangjie Chen · Jiawei Qiu · Yuyang zhang · Run Wang · Jin Mao · Tianwei Zhang · Lina Wang
[ Exhibit Hall I ]
Abstract
Text-to-image models have achieved remarkable progress in generating high-quality images from textual prompts, yet their potential for misuse like generating unsafe content remains a critical concern.Existing safety mechanisms, such as filtering and fine-tuning, remain insufficient in preventing vulnerabilities exposed by adversarial prompts. To systematically evaluate these weaknesses, we propose an automated red-teaming framework, Feedback-Guided Prompt Iteration (FGPI), which utilizes a Vision-Language Model (VLM) as the red-teaming agent following a feedback-guide-rewrite paradigm for iterative prompt optimization.The red-teaming VLM analyzes prompt-image pairs based on evaluation results, provides feedback and modification strategies to enhance adversarial effectiveness while preserving safety constraints, and iteratively improves prompts.To enable this functionality, we construct a multi-turn conversational VQA dataset with over 6,000 instances, covering seven attack types and facilitating the fine-tuning of the red-teaming VLM.Extensive experiments demonstrate the effectiveness of our approach, achieving over 90\% attack success rate within five iterations while maintaining prompt stealthiness and safety. The experiments also validate the adaptability, diversity, transferability, and explainability of FGPI.The source code and dataset are available at (*URL omitted for double-blind reviewing; code available in supplementary materials*).
Poster
Seunggwan Lee · Hwanhee Jung · ByoungSoo Koh · Qixing Huang · Sang Yoon · Sangpil Kim
[ Exhibit Hall I ]
Abstract
A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.
Poster
Yuqing Wang · Zhijie Lin · Yao Teng · Yuanzhi Zhu · Shuhuai Ren · Jiashi Feng · Xihui Liu
[ Exhibit Hall I ]
Abstract
Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling.
Poster
Yuanzhi Zhu · Xi WANG · Stéphane Lathuilière · Vicky Kalogeiton
[ Exhibit Hall I ]
Abstract
Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator.Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an `on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.
Poster
Trevor Canham · SaiKiran Tedla · Michael Murdoch · Michael Brown
[ Exhibit Hall I ]
Abstract
While most images shared on the web and social media platforms are encoded in standard dynamic range (SDR), many displays now can accommodate high dynamic range (HDR) content. Additionally, modern cameras can capture images in an HDR format but convert them to SDR to ensure maximum compatibility with existing workflows and legacy displays. To support both SDR and HDR, new encoding formats are emerging that store additional metadata in SDR images in the form of a gain map. When applied to the SDR image, the gain map recovers the HDR version of the image as needed. These gain maps, however, are typically down-sampled and encoded using standard image compression, such as JPEG and HEIC, which can result in unwanted artifacts. In this paper, we propose to use a lightweight multi-layer perceptron (MLP) network to encode the gain map. The MLP is optimized using the SDR image information as input and provides superior performance in terms of HDR reconstruction. Moreover, the MLP-based approach uses a fixed memory footprint (10 KB) and requires no additional adjustments to accommodate different image sizes or encoding parameters. We conduct extensive experiments on various MLP based HDR embedding strategies and demonstrate that our approach outperforms the …
Poster
Tu Bui · Shruti Agarwal · John Collomosse
[ Exhibit Hall I ]
Abstract
Imperceptible digital watermarking is important in copyright protection, misinformation prevention, and responsible generative AI. We propose TrustMark - a watermarking method that leverages a spatio-spectral loss function and a 1x1 convolution layer to enhance encoding quality. TrustMark is robust against both in-place and out-of-place perturbations while maintaining image quality above 43 dB. Additionally, we propose TrustMark-RM, a watermark removal method designed for re-watermarking, along with a simple yet effective algorithm that enables both TrustMark and TrustMark-RM to operate seamlessly across arbitrary resolutions. Our methods achieve state-of-art performance on 3 benchmarks. Models and code are released under MIT license and an anonymized version is included for review.
Poster
Hongyang Wei · Shuaizheng Liu · Chun Yuan · Lei Zhang
[ Exhibit Hall I ]
Abstract
By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-$k$ sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for …
Poster
Ling Lo · Kelvin Chan · Wen-Huang Cheng · Ming-Hsuan Yang
[ Exhibit Hall I ]
Abstract
Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions.The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In contrast, we extend the model to generate smooth and consistent attribute transitions by introducing frame-wise guidance for the video latent during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions.
Poster
Youwei Zheng · Yuxi Ren · Xin Xia · Xuefeng Xiao · Xiaohua Xie
[ Exhibit Hall I ]
Abstract
Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%.Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity.To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
Poster
Fangfu Liu · Hanyang Wang · Yimo Cai · Kaiyan Zhang · Xiaohang Zhan · Yueqi Duan
[ Exhibit Hall I ]
Abstract
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively …
Poster
shaojin wu · Mengqi Huang · wenxu wu · Yufeng Cheng · Fei Ding · Qian HE
[ Exhibit Hall I ]
Abstract
Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to address this challenge. It leverages the intrinsic in-context generation capabilities of diffusion transformers. Additionally, we introduce $UNO$, which consist of progressive cross-modal alignment and universal rotary position embedding. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
Poster
Shengrong Yuan · Runmin Wang · Ke Hao · Xu-Qi Ma · Changxin Gao · Li Liu · Nong Sang
[ Exhibit Hall I ]
Abstract
Scene text image super-resolution (STISR) focuses on enhancing the clarity and readability of low-resolution text images. Existing methods often rely on text probability distribution priors derived from text recognizers to guide the super-resolution process. While effective in capturing general structural information of text, these priors lack the ability to preserve specific text style details, such as font, stereoscopic effect and spatial transformation, leading to a loss of visual quality and stylistic consistency in the super-resolved images. To address these limitations, we propose a Style embedding-based scene text image Super-Resolution Network (StyleSRN), which introduces a text style embedding mechanism to preserve and enhance text style features during the super-resolution process. The proposed architecture includes a Style Enhancement Block for capturing multi-scale cross-channel dependencies, and a Style Content Fusion Block that effectively integrates text content with style information, ensuring that the structure and style of the restored text are not distorted. Furthermore, we introduce a Text Style Loss based on the Gram matrix to supervise the reconstruction process at the style level, thereby maintaining the stylistic consistency of the restored text images. Extensive experiments on the TextZoom dataset and five scene text recognition benchmarks demonstrate the superiority of our method. The code …
Poster
Wei Chen · Jingxi Yu · Zichen Miao · Qiang Qiu
[ Exhibit Hall I ]
Abstract
Large pre-trained transformers have revolutionized artificial intelligence across various domains, and fine-tuning remains the dominant approach for adapting these models to downstream tasks due to the cost of training from scratch. However, in existing fine-tuning methods, the updated representations are formed as a dense combination of modified parameters, making it challenging to interpret their contributions and understand how the model adapts to new tasks.In this work, we introduce a fine-tuning framework inspired by sparse coding, where fine-tuned features are represented as a sparse combination of basic elements, i.e., feature dictionary atoms. Sparse coefficients then serve as indicators of atom importance, identifying the contribution of each atom to the updated representation.The feature dictionary atoms function as fundamental building blocks of the representation, and tuning atoms allows for seamless adaptation to downstream tasks.Leveraging the atom selection capability of sparse coefficients, we first demonstrate that our method enhances image editing performance by improving text alignment through the removal of unimportant feature dictionary atoms.Additionally, we validate the effectiveness of our approach in the text-to-image concept customization task, where our method efficiently constructs the target concept using a sparse combination of feature dictionary atoms, outperforming various baseline fine-tuning methods.
Poster
Xinli Xu · Wenhang Ge · Jiantao Lin · Jiawei Feng · Lie XU · hanfeng Zhao · Shunsi Zhang · Ying-Cong Chen
[ Exhibit Hall I ]
Abstract
In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts. Additionally, users can influence attributes such as appearance and material properties, including metallic and roughness. Extensive experiments demonstrate that our approach offers enhanced multiple controllability, marking a significant advancement over existing multi-view diffusion models. This work has substantial implications for fields requiring rapid and flexible 3D content creation, including game development, animation, and virtual reality.
Poster
Yatai Ji · Jiacheng Zhang · Jie Wu · Shilong Zhang · Shoufa Chen · Chongjian GE · Peize Sun · Weifeng Chen · Wenqi Shao · Xuefeng Xiao · Weilin Huang · Ping Luo
[ Exhibit Hall I ]
Abstract
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.
Poster
Delong Zhang · Qiwei Huang · Yang Sun · Yuanliu Liu · Wei-Shi Zheng · Pengfei Xiong · Wei Zhang
[ Exhibit Hall I ]
Abstract
Diffusion-based virtual try-on aims to synthesize a realistic image that seamlessly integrating the specific garment into a target model.The primary challenge lies in effectively guiding the warping process of diffusion model.However, previous methods either lack direct guidance or explicitly warp the garment image, which highly depends on the performance of the warping module.In this paper, we propose FIA-VTON, which leverages the \textbf{implicit} flow feature as guidance by adopting a Flow Infused Attention module on virtual try-on. The dense warp flow map is projected as indirect guidance attention to enhance the feature map warping in the generation process implicitly, which is less sensitive to the warping estimation accuracy than an explicit warp of the garment image. To further enhance implicit warp guidance, we incorporate high-level spatial attention to complement the dense warp.Experimental results on the VTON-HD and DressCode dataset significantly outperform state-of-the-art methods, demonstrating that FIA-VTON is effective and robust for virtual try-on.
Poster
Ziyin Zhou · Yunpeng Luo · Yuanchen Wu · Ke Sun · Jiayi Ji · Ke Yan · Shouhong Ding · Xiaoshuai Sun · Yunsheng Wu · Rongrong Ji
[ Exhibit Hall I ]
Abstract
The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on …
Poster
Sung Ju Lee · Nam Ik Cho
[ Exhibit Hall I ]
Abstract
Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry.Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy.Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID scores.Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking.Our code will be publicly available soon.
Poster
Tianwei Xiong · Jun Hao Liew · Zilong Huang · Jiashi Feng · Xihui Liu
[ Exhibit Hall I ]
Abstract
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality—a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers: (1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
Poster
Francesco Taioli · Edoardo Zorzi · Gianni Franchi · Alberto Castellini · Alessandro Farinelli · Marco Cristani · Yiming Wang
[ Exhibit Hall I ]
Abstract
Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent.While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA …
Poster
Duong T. Tran · Trung-Kien Tran · Manfred Hauswirth · Danh Le-Phuoc
[ Exhibit Hall I ]
Abstract
In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.
Poster
Achint Soni · Meet Soni · Sirisha Rambhatla
[ Exhibit Hall I ]
Abstract
Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce \textbf{LOCATEdit}, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks.
Poster
Mang Cao · Sanping Zhou · Yizhe Li · Ye Deng · Wenli Huang · Le Wang
[ Exhibit Hall I ]
Abstract
Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scan mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scan modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, i.e., NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.
Poster
Trong Bang Nguyen · Phi Le Nguyen · Simon Lucey · Minh Hoai
[ Exhibit Hall I ]
Abstract
Data attribution in text-to-image generative models is a crucial yet underexplored problem, particularly at the regional level, where identifying the most influential training regions for generated content can enhance transparency, copyright protection, and error diagnosis. Existing data attribution methods either operate at the whole-image level or lack scalability for large-scale generative models. In this work, we propose a novel framework for region-level data attribution. At its core is the Attribution Region (AR) detector, which localizes influential regions in training images used by the text-to-image generative model. To support this research, we construct a large-scale synthetic dataset with ground-truth region-level attributions, enabling both training and evaluation of our method. Empirical results show that our method outperforms existing attribution techniques in accurately tracing generated content back to training data. Additionally, we demonstrate practical applications, including identifying artifacts in generated images and suggesting improved replacements for generated content. Our dataset and framework will be released to advance further research in region-level data attribution for generative models.
Poster
Yuqi Li · Haotian Zhang · Li Li · Dong Liu
[ Exhibit Hall I ]
Abstract
Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step to effectively exploit diverse contextual information. Experimental results demonstrate our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity.
Poster
Joowon Kim · Ziseok Lee · Donghyeon Cho · Sanghyun Jo · Yeonsung Jung · Kyungsu Kim · Eunho Yang
[ Exhibit Hall I ]
Abstract
Despite recent advances in diffusion models, achieving reliable image generation and editing results remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Particularly, instruction-guided image editing with diffusion models offers user-friendly editing capabilities, yet editing failures, such as background distortion, frequently occur across different attempts. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient.While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting their applicability, and evaluating multiple seeds increases computational complexity, reducing practicality.To address this, we first establish a new multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identfying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while fully preserving editability.Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed + prompt selection, further improving results when …
Poster
Tong Wei · Yijun Yang · Junliang Xing · Yuanchun Shi · Zongqing Lu · Deheng Ye
[ Exhibit Hall I ]
Abstract
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.
Poster
Jialu Gao · Joseph K J · Fernando De la Torre
[ Exhibit Hall I ]
Abstract
The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject’s identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.
Poster
Aniruddha Bala · Rohit Chowdhury · Rohan Jaiswal · Siddharth Roheda
[ Exhibit Hall I ]
Abstract
Advancements in diffusion models have enabled effortless image editing via text prompts, raising concerns about image security. Attackers with access to user images can exploit these tools for malicious edits. Recent defenses attempt to protect images by adding a limited noise in the pixel space to disrupt the functioning of diffusion-based editing models. However, the adversarial noise added by previous methods is easily noticeable to the human eye. Moreover, most of these methods are not robust to purification techniques like JPEG compression under a feasible pixel budget. We propose a novel optimization approach that introduces adversarial perturbations directly in the frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients of the input image. By leveraging the JPEG pipeline, our method generates adversarial images that effectively prevent malicious image editing. Extensive experiments across a variety of tasks and datasets demonstrate that our approach introduces fewer visual artifacts while maintaining similar levels of edit protection and robustness to noise purification techniques.
Poster
Junlong Tong · Wei Zhang · Yaohui Jin · Xiaoyu Shen
[ Exhibit Hall I ]
Abstract
Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context for entropy models often relies on intricate model designs, increasing complexity and computational costs. Furthermore, entropy models employing autoregressive or checkerboard strategies fail to model the significance of spatial context order, potentially limiting the availability of relevant contextual information during decoding. To address these issues, we propose the context guided transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal and importance-weighted spatial contexts. The temporal context resampler learns predefined latent queries and utilizes transformer encoders to fuse the resampled critical information while reducing subsequent computational overhead. Subsequently, we design a teacher-student network to explicitly model the importance of spatial context order. During training, the teacher network generates an attention map (i.e., importance scores) and an entropy map (i.e., confidence scores) from randomly masked inputs, guiding the student network to select top-k weighted decoding tokens as subsequent contextual information. During inference, only the student network is employed, utilizing high-importance and high-confidence tokens to guide the prediction of the remaining undecoded tokens. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65\% lowers the BD …
Poster
Jinghao Wang · Zhang Li · Zi Wang · Banglei Guan · Yang Shang · Qifeng Yu
[ Exhibit Hall I ]
Abstract
Recently, 6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling, providing compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
Poster
Richard Liu · Daniel Fu · Noah Tan · Itai Lang · Rana Hanocka
[ Exhibit Hall I ]
Abstract
In this work we present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.
Poster
Enis Simsar · Alessio Tonioni · Yongqin Xian · Thomas Hofmann · Federico Tombari
[ Exhibit Hall I ]
Abstract
We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. Existing methods rely on supervised learning with triplets of input images, ground-truth edited images, and edit instructions. These triplets are typically generated either by existing editing methods—introducing biases—or through human annotations, which are costly and limit generalization. Our approach addresses these challenges by introducing a novel editing mechanism called Edit Reversibility Constraint (ERC), which applies forward and reverse edits in one training step and enforces alignment in image, text, and attention spaces. This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-instruction triplets. We empirically show that our approach performs better across a broader range of edits with high-fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with current methods, and proposing ERC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.
Poster
Zeyu Liu · Zanlin Ni · Yeguo Hua · Xin Deng · Xiao Ma · Cheng Zhong · Gao Huang
[ Exhibit Hall I ]
Abstract
Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both \emph{compressing} visual signals into a compact representation and \emph{discretizing} them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs---already optimized for perceptual compression---into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, With $\mathbf{6 \times}$ less training budget compared to standard VQGAN, our approach achieves a remarkable codebook utilization of \textbf{100\%} and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression respectively.
Poster
Yiyang Wang · Xi Chen · Xiaogang Xu · Sihui Ji · Yu Liu · Yujun Shen · Hengshuang Zhao
[ Exhibit Hall I ]
Abstract
In spite of recent progress, image diffusion models still produce artifacts. A common solution is to leverage the feedback provided by quality assessment systems or human annotators to optimize the model, where images are generally rated in their entirety. In this work, we believe $\textbf{problem-solving starts with identification}$, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to optimize the diffusion model by providing pixel-level feedback. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.
Poster
Ruidong Chen · honglin guo · Lanjun Wang · Chenyu Zhang · Weizhi Nie · Anan Liu
[ Exhibit Hall I ]
Abstract
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an efficient trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying an effective mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the …
Poster
Hengrui Kang · Siwei Wen · Zichen Wen · Junyan Ye · Weijia Li · Peilin Feng · Baichuan Zhou · Bin Wang · Dahua Lin · Linfeng Zhang · Conghui He
[ Exhibit Hall I ]
Abstract
The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model~(MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation.Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31\% in mIoU and 7.75\% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. …
Poster
Zheng-Peng Duan · jiawei zhang · Xin Jin · Ziheng Zhang · Zheng Xiong · Dongqing Zou · Jimmy Ren · Chun-Le Guo · Chongyi Li
[ Exhibit Hall I ]
Abstract
Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors.The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation,which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR?To this end,we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR.Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet,we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent.The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step.Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information.These simple but effective designs endow the DiT model with superior performance in Real-ISR,which is demonstrated by extensive experiments.The code will be available to the community.
Poster
Mian Zou · Nan Zhong · Baosheng Yu · Yibing Zhan · Kede Ma
[ Exhibit Hall I ]
Abstract
Supervised learning has been the dominant approach for developing detectors of AI-generated face images. However, the reliance on pre-generated face samples often limits the adaptability to the diverse and rapidly evolving landscape of AI face generators. Here, we propose a bi-level optimization framework for self-supervised AI-generated face detection, relying solely on photographic images and aligning the pretext tasks with the downstream AI face detection. The inner loop optimization aims to train a feature extractor using linearly weighted objectives of several pretext tasks, including classifying categorical exchangeable image file format (EXIF) tags, ranking ordinal EXIF tags, and identifying global and local face manipulations. The outer loop optimization treats the coarse-grained detection of face manipulations as a surrogate task for AI-generated image detection, directing the feature extractor to adapt to detecting AI faces by optimizing the linear weightings to align the task relationships. To validate the effectiveness of our self-supervised features, we first frame AI-generated face detection as one-class classification, and model the feature distribution of photographic images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Additionally, we train a two-layer perceptron based on the extracted self-supervised features as a simple binary classifier. We demonstrate by comprehensive …
Poster
Zhong-Yu Li · Ruoyi Du · Juncheng Yan · Le Zhuo · Zhen Li · Peng Gao · Zhanyu Ma · Ming-Ming Cheng
[ Exhibit Hall I ]
Abstract
Recent advances in diffusion models have significantly advanced image generation; however, existing models remain task-specific, limiting their efficiency and generalizability. While universal models attempt to address these limitations, they face critical challenges, including generalizable instruction design, appropriate task distributions, and unified architectural design. In this work, we propose VisualCloze, a universal image generation framework, to tackle these challenges. Unlike existing methods that rely on language-based task descriptions, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and knowledge transfer. Furthermore, we uncover an intrinsic alignment between image infilling and in-context learning, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying their architectures. Experiments demonstrate that VisualCloze achieves strong performance across more than 100 in-domain tasks while generalizing to unseen tasks in few-shot and zero-shot settings.
Poster
Zexi Jia · Chuanwei Huang · Hongyan Fei · Yeshuang Zhu · Zhiqiang Yuan · Ying Deng · Jiapei Zhang · Jinchao Zhang · Jie Zhou
[ Exhibit Hall I ]
Abstract
Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.
Poster
Wenshuo Gao · Xicheng Lan · Shuai Yang
[ Exhibit Hall I ]
Abstract
Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce AnyPortal, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. AnyPortal is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that AnyPortal achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.
Poster
Yefei He · Yuanyu He · Shaoxuan He · Feng Chen · Hong Zhou · Kaipeng Zhang · Bohan Zhuang
[ Exhibit Hall I ]
Abstract
Visual autoregressive models typically adhere to a raster-order "next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant.In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far "next-neighbor prediction" mechanism.Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region.To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension.During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation.Experiments on ImageNet 256$\times$256 and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach.When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4% of the …
Poster
Hang Guo · Yawei Li · Taolin Zhang · Jiangshan Wang · Tao Dai · Shu-Tao Xia · Luca Benini
[ Exhibit Hall I ]
Abstract
Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7x with negligible performance drop of <1%. We further extend \NAME to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU.
Poster
ying ba · Tianyu Zhang · Yalong Bai · Wenyi Mo · Tao Liang · Bing Su · Ji-Rong Wen
[ Exhibit Hall I ]
Abstract
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train a HP (High-Preference) score model using solely the image modality, aiming to enhance image quality in aspects such as aesthetics and detail refinement while maintaining achieved text-image alignment.Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10\% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical foundation and empirical support for the evolution of image generation technology toward better alignment with higher-order human aesthetic preferences.
Poster
Yian Zhao · rushi ye · Ruochong Zheng · Zesen Cheng · Chaoran Feng · Jiashu Yang · Pengchong Qiao · Chang Liu · Jie Chen
[ Exhibit Hall I ]
Abstract
3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed Tune-Your-Style, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized …
Poster
Tiancheng SHEN · Jun Hao Liew · Zilong Huang · Xiangtai Li · Zhijie Lin · Jiyang Liu · Yitong Wang · Jiashi Feng · Ming-Hsuan Yang
[ Exhibit Hall I ]
Abstract
Multimodal Diffusion Transformers (MM-DiTs) have recently emerged as a powerful framework for unified text-vision synthesis, surpassing traditional U-Net architectures in generative tasks. One key innovation lies in its Multimodal Self-Attention (MM-SA) interaction where image and text tokens are concatenated and processed via self-attention.However, this mechanism poses significant challenges for editing, rendering conventional U-Net-based attention manipulation methods ineffective. To address this limitation, we propose QK-Edit, a training-free framework that exploits the unique attention dynamics of MM-DiTs for precise text-guided image and video editing. By introducing a novel query-key manipulation strategy, our method isolates and adjusts critical attention components to achieve an optimal balance between prompt fidelity and structural consistency. This enables seamless edits across various tasks, including object addition, object removal, object replacement, changing background, changing material, changing color, and style transformation. Notably, it can be easily implemented with feature replacement in inference.QK-Edit demonstrates superior editing performance on state-of-the-art models, such as FLUX and HunyuanVideo, effectively bridging the gap between generative power and editable flexibility in MM-DiTs, and paving the way for scalable multimodal content creation. The code will be made publicly available.
Poster
Gang Dai · Yifan Zhang · Yutao Qin · Qiangya Guo · Shuangping Huang · Shuicheng YAN
[ Exhibit Hall I ]
Abstract
Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text line emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns—encompassing both intra- and inter-word relationships—and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text-lines, particularly in style reproduction and content preservation. Our source code will be made publicly available.
Poster
hongji yang · Wencheng Han · Yucheng Zhou · Jianbing Shen
[ Exhibit Hall I ]
Abstract
In this paper, we introduce DC (Decouple)-ControlNet, a highly flexible and precisely controllable framework for multi-condition image generation. The core idea behind DC-ControlNet is to decouple control conditions, transforming global control into a hierarchical system that integrates distinct elements, contents, and layouts. This enables users to mix these individual conditions with greater flexibility, leading to more efficient and accurate image generation control. Previous ControlNet-based models rely solely on global conditions, which affect the entire image and lack the ability of element- or region-specific control. This limitation reduces flexibility and can cause condition misunderstandings in multi-conditional image generation. To address these challenges, we propose both intra-element and inter-element Controllers in DC-ControlNet. The Intra-Element Controller handles different types of control signals within individual elements, accurately describing the content and layout characteristics of the object. For interactions between elements, we introduce the Inter-Element Controller, which accurately handles multi-element interactions and occlusion based on user-defined relationships. Extensive evaluations show that DC-ControlNet significantly outperforms existing ControlNet models and Layout-to-Image generative models in terms of control flexibility and precision in multi-condition control.
Poster
Ruotong Wang · Mingli Zhu · Jiarong Ou · Rui Chen · Xin Tao · Pengfei Wan · Baoyuan Wu
[ Exhibit Hall I ]
Abstract
Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content.Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information;(2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information.Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames.Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and …
Poster
Hailong Guo · Bohan Zeng · Yiren Song · Wentao Zhang · Jiaming Liu · Chuang Zhang
[ Exhibit Hall I ]
Abstract
Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person’s image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with …
Poster
Wenzhuang Wang · Yifan Zhao · Mingcan Ma · Ming Liu · Zhonglin Jiang · Yong Chen · Jia Li
[ Exhibit Hall I ]
Abstract
Layout-to-image (L2I) generation has exhibited promising results in natural image generation, but they face challenges and even fail when applied to degraded scenarios (\ie, low-light, underwater). This is primarily attributed to the ``contextual illusion dilemma'' within degraded contexts, where foreground instances are overwhelmed by context-dominant frequency distributions. Motivated by this, our paper proposes a new Frequency-Inspired Contextual Disentanglement Generative (FICGen) paradigm, which seeks to transfer frequency-aware knowledge (\ie, edges, textures) into the latent diffusion space, thereby better rendering the degraded instances via frequency-aware guidance. To be specific, FICGen consists of two major steps. First, we introduce a learnable dual-query mechanism, each paired with individual frequency resamplers, to perceive contextual frequency prototypes disentangled by degraded images. Subsequently, a visual-frequency enhanced attention is employed to incorporate the frequency knowledge within these prototypes into the degraded instance generation process. Second, to alleviate the attribute leakage and compensate for sample loss in dense and small objects, we propose an instance coherence map to regulate instance isolation, coupled with an adaptive spatial-frequency aggregation module to merge them in a spatial-frequency mixed manner. Extensive quantitative and qualitative experiments against L2I methods on four benchmarks illustrate superior quality and trainability of FICGen towards diverse degradation circumstances.
Poster
Jathushan Rajasegaran · Ilija Radosavovic · Rahul Ravishankar · Yossi Gandelsman · Christoph Feichtenhofer · Jitendra Malik
[ Exhibit Hall I ]
Abstract
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate.
Poster
Etai Sella · Noam Atia · Ron Mokady · Hadar Averbuch-Elor
[ Exhibit Hall I ]
Abstract
Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape's identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often innacurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description. We will release our code and trained models.
Poster
Hanshen Zhu · Zhen Zhu · Kaile Zhang · Yiming Gong · Yuliang Liu · Xiang Bai
[ Exhibit Hall I ]
Abstract
We tackle the problem of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, which proves difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity and edit precision, especially under demanding transformations. We will release our codes and benchmark when the paper becomes publicly available.
Poster
Seunghyun Shin · Dongmin Shin · Jisu Shin · Hae-Gon Jeon · Joon-Young Lee
[ Exhibit Hall I ]
Abstract
Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. To validate its robustness, we provide our source code and video demo as supplementary materials.
Poster
Do Dat · Nam Hyeon-Woo · Po-Yuan Mao · Tae-Hyun Oh
[ Exhibit Hall I ]
Abstract
Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
Poster
Junfei Xiao · Feng Cheng · Lu Qi · Liangke Gui · Yang Zhao · Shanchuan Lin · Jiepeng Cen · Zhibei Ma · Alan Yuille · Lu Jiang
[ Exhibit Hall I ]
Abstract
Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Codes and data will be made publicly available.
Poster
Jiwoo Chung · Sangeek Hyun · Hyunjun Kim · Eunseo Koh · Minkyu Lee · Jae-Pil Heo
[ Exhibit Hall I ]
Abstract
Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pre-trained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability.Visual Auto-Regressive (VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, naive fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.
Poster
En Ci · Shanyan Guan · Yanhao Ge · Yilin Zhang · Wei Li · Zhenyu Zhang · Jian Yang · Ying Tai
[ Exhibit Hall I ]
Abstract
Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based methods introduce reconstruction errors and inefficiencies, while instruction-based models suffer from limited datasets, architectural constraints, and high computational costs. We propose DescriptiveEdit, a description-driven editing framework that preserves the generative power of pre-trained T2I models without architectural modifications or inversion. A Cross-Attentive UNet with an attention bridge enables direct feature fusion, while LoRA-based tuning ensures efficiency and compatibility. Without retraining, DescriptiveEdit seamlessly integrates with ControlNet, IP-Adapter, and other extensions. Experiments show it improves editing accuracy and consistency while significantly reducing computational costs, providing a scalable and flexible solution for text-guided image manipulation.
Poster
Zheng Gao · Jifei Song · Zhensong Zhang · Jiankang Deng · Ioannis Patras
[ Exhibit Hall I ]
Abstract
Current training-free text-driven image translation primarily uses diffusion features (convolution and attention) of pre-trained model as guidance to preserve the style/structure of source image in translated image. However, the coarse guidance at feature level struggles with style (e.g., visual patterns) and structure (e.g., edges) alignment with the source. Based on the observation that the low-/high-frequency components retain style/structure information of image, in this work, we propose training-free Frequency-Guided Diffusion (FGD), which tailors low-/high-frequency guidance for style- and structure-guided translation, respectively. For low-frequency guidance (style-guided), we substitute the low-frequency components of diffusion latents from sampling process with those from inversion of source and normalize the obtained latent with composited spectrum to enforce color alignment. For high-frequency guidance (structure-guided), we propose high-frequency alignment and high-frequency injection that compensate each other. High-frequency alignment preserves edges and contour by adjusting the predicted noise with guidance function that aligns high-frequency image regions between sampling and source image. High-frequency injection facilitates layout preservation by injecting high-frequency components of diffusion convolution features (from inversion) to sampling process. Qualitative and quantitative results verify the superiority of our method on style- and structure-guided translation tasks. We make the code publicly available at: withheld during review.
Poster
Ming Li · Xin Gu · Fan Chen · Xiaoying Xing · Longyin Wen · Chen Chen · Sijie Zhu
[ Exhibit Hall I ]
Abstract
Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a …
Poster
Rongkun Xue · Jinouwen Zhang · Yazhe Niu · Dazhong Shen · Bingqi Ma · Yu Liu · Jing Yang
[ Exhibit Hall I ]
Abstract
Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous generation model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. This framework enables the flexible selection of feature hierarchies tailored to specific downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model based methods, including 78% top-1 accuracy on ImageNet at a resolution of 64×64. Extensive ablation studies, including out-of-distribution evaluations, further validate the effectiveness of our approach.
Poster
Naifu Xue · Zhaoyang Jia · Jiahao Li · Bin Li · Yuan Zhang · Yan Lu
[ Exhibit Hall I ]
Abstract
Recent studies in extreme image compression have achieved remarkable performance by compressing the tokens from generative tokenizers. However, these methods often prioritize clustering common semantics within the dataset, while overlooking the diverse details of individual objects. Consequently, this results in suboptimal reconstruction fidelity, especially at low bitrates. To address this issue, we introduce a Dual-generative Latent Fusion (DLF) paradigm. DLF decomposes the latent into semantic and detail elements, compressing them through two distinct branches. The semantic branch clusters high-level information into compact tokens, while the detail branch encodes perceptually critical details to enhance the overall fidelity. Additionally, we propose a cross-branch interactive design to reduce redundancy between the two branches, thereby minimizing the overall bit cost. Experimental results demonstrate the impressive reconstruction quality of DLF even below 0.01 bits per pixel (bpp). On the CLIC2020 test set, our method achieves bitrate savings of up to 27.93% on LPIPS and 53.55% on DISTS compared to MS-ILLM. Furthermore, DLF surpasses recent diffusion-based codecs in visual fidelity while maintaining a comparable level of generative realism. Code will be available later.
Poster
Rui Tian · Qi Dai · Jianmin Bao · Kai Qiu · Yifan Yang · Chong Luo · Zuxuan Wu · Yu-Gang Jiang
[ Exhibit Hall I ]
Abstract
Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.One crucial obstacle for large-scale applications is the expensive training and inference cost.In this paper, we argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.Towards this goal, we design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images. This magic Reducio charm enables 64$\times$ reduction of latents compared to a common 2D VAE, without sacrificing the quality.Building upon Reducio-VAE, we can train diffusion models for high-resolution video generation efficiently. Specifically, we adopt a two-stage generation paradigm, first generating a condition image via text-to-image generation, followed by text-image-to-video generation with the proposed Reducio-DiT. Extensive experiments show that our model achieves strong performance in evaluation.More importantly, our method significantly boosts the training and inference efficiency of video LDMs. Reducio-DiT is trained in just 3.2K A100 GPU hours in total and can generate a 16-frame 1024$\times$1024 video clip within 15.5 seconds on a single A100 GPU.
Poster
Yichen Lu · Siwei Nie · Minlong Lu · Xudong Yang · Xiaobo Zhang · Peng Zhang
[ Exhibit Hall I ]
Abstract
Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7\% $\mu$AP / 83.9\% RP90 for matcher, 72.6\% $\mu$AP / 68.4\% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods. Code is available.
Poster
Haodong Jing · Dongyao Jiang · Yongqiang Ma · Haibo Hua · Bo Huang · Nanning Zheng
[ Exhibit Hall I ]
Abstract
Decoding visual information from fMRI signals is an important pathway to understand how the brain represents the world, and is a cutting-edge field of artificial general intelligence. Decoding fMRI should not be limited to reconstructing visual stimuli, but also further transforming them into descriptions, creating actions, and even generating unseen content. We purposefully propose a novel and efficient brain multimodal architecture, NeuroCreat, which combines the powerful visual and textual abilities of LLM to capture fine-grained semantic information from fMRI, transformed it into an embodied implementation of different neural representations. Specifically, we innovatively designed a brain expert adaption (BEA) module, effectively capturing commonalities and individual differences among subjects through the collaborative learning of shared/routed experts. Inspired by human visual working memory, we extracted ``creation'' information from higher visual cortex for idea generation. We further constructed a prompt variant alignment module, seamlessly integrates fMRI-visual-semantic-creation into LLM to achieve flexible incorporation of different semantics in the decoding of neural representations. Experiments on different fMRI datasets show that NeuroCreat achieves SOTA performance on multiple brain decoding tasks. More importantly, we have innovatively achieved few-shot brain video creation, which opens up a new direction for demonstrating the brain's `imaginative’ ability.
Poster
Marcos Conde · Zihao Lu · Radu Timofte
[ Exhibit Hall I ]
Abstract
Text-guided image generation and editing is emerging as a fundamental problem in computer vision. However, most approaches lack control, and the generated results are far from professional photography quality standards. In this work, we propose the first approach that introduces language and explicit control into the image processing and editing pipeline. PixTalk is a vision-language multi-task image processing model, guided using text instructions. Our method is able to perform over 40 transformations --the most popular techniques in photography--, delivering results as professional photography editing software. Our model can process 12MP images on consumer GPUs in real-time (under 1 second). As part of this effort, we propose a novel dataset and benchmark for new research on multi-modal image processing and editing.
Poster
KA WONG · Jicheng Zhou · Haiwei Wu · Yain-Whar Si · Jiantao Zhou
[ Exhibit Hall I ]
Abstract
The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for reliable forgery detection. Though forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document backgrounds and structured texts. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces' sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-background disparities. Furthermore, noticing the predominantly pristine nature of background regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79\% averaged over 5 types of distortions. The code is available in supplementary …
Poster
Wenkui Yang · Jie Cao · Junxian Duan · Ran He
[ Exhibit Hall I ]
Abstract
Diffusion models like Stable Diffusion have become prominent in visual synthesis tasks due to their powerful customization capabilities. However, these capabilities also introduce significant security risks, such as deepfakes and copyright infringement. To mitigate these risks, a class of methods known as protective perturbation emerged, which prevents image misuse by injecting imperceptible adversarial noise.On the other hand, purification methods can effectively remove the protective perturbation, thereby exposing images again to the risk of malicious forgery.In this work, we formalize the anti-purification task, highlighting the challenges that existing approaches can not address properly, and propose a solution named **AntiPure**.AntiPure is robust against the "purification-customization'' workflow, owing to the two types of proposed guidance: 1) Patch-wise Frequency Guidance, which reduces the model’s influence over high-frequency components in the purified image, and 2) Erroneous Timestep Guidance, which disrupts the model’s denoising strategy across different timesteps.With additional guidance, AntiPure embeds imperceptible perturbation patterns resistant to purification, achieving effective output distortion after customization. Experiments show that our approach achieves minimal perceptual discrepancy, maximal distortion, and robust performance, outperforming current protective perturbation methods within the purification-customization workflow.
Poster
Xiaoyi Feng · Tao Huang · Peng Wang · Zizhou Huang · Haihang Zhang · Yuntao Zou · Dagang Li · Kaifeng Zou
[ Exhibit Hall I ]
Abstract
Line drawing colorization is a critical step in the cel-animation industry, where artists use a paint bucket tool to apply RGB values to segments based on a character’s color design sheet. Current automated methods predominantly focus on consecutive frame colorization, using a single adjacent frame as a reference. These approaches often face two major challenges: inaccurate segment colorization due to significant deformations between the target and reference frames, and incomplete information in a single frame that prevents finding suitable reference segments, leading to poor color accuracy. To address these challenges, we propose a novel colorization framework that integrates both temporal and structural information. Using multiple reference keyframes, our method effectively captures temporal information across frames, enhancing the accuracy of colorization for transitional frames. In addition, we leverage structural information through a matching-based approach that ensures precise segment alignment across frames. This combination of temporal awareness through multi-frame references and structural alignment improves colorization robustness, even in scenarios with large motion and deformations. Our method outperforms existing techniques, demonstrating superior colorization accuracy and consistency in industrial cel-animation workflows.
Poster
Jiawei Wang · Zhiming Cui · Changjian Li
[ Exhibit Hall I ]
Abstract
This paper presents VQ-SGen, a novel algorithm for high-quality creative sketch generation. Recent approaches have framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in stage one, we decouple each stroke's shape and location information to ensure the VQ representation prioritizes stroke shape learning. In stage two, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as text or class label conditioned generation and sketch completion. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques on the CreativeSketch dataset, underscoring its effectiveness. The code and model will be made publicly available upon publication.
Poster
Bo Zhao · Haoran Wang · Jinghui Wang · Hanzhang Wang · Huan Yang · Wei Ji · Hao Liu · Xinyan Xiao
[ Exhibit Hall I ]
Abstract
In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution paradigm for content-aware layout GenerAtion. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module is leveraged to perform fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the module to enhance its layout planning ability. Moreover, we present a new large-scale poster dataset, namely BIG-Poster with rich meta-information annotation. We conduct extensive experiments and obtain remarkable state-of-the-art performance improvement on multiple benchmark datasets.
Poster
Chen Zhennan · Yajie Li · Haofan Wang · Zhibo Chen · Zhengkai Jiang · Jun Li · Qian Wang · Jian Yang · Ying Tai
[ Exhibit Hall I ]
Abstract
Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAGD, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAGD decouples the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAGD novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAGD achieves superior performance over attribute binding and object relationship than previous methods.
Poster
Yuyan Chen · Yifan Jiang · Li Zhou · Jinghan Cao · Yu Guan · Ming Yang · Qingpei Guo
[ Exhibit Hall I ]
Abstract
In recent years, multi-modal large language models (MLLMs) have been successfully adopted to generate humorous and engaging descriptions for internet memes. While, it is challenging for the same approaches to apply to ordinary images which lack of inherent funny or exaggerated contents. Thus, crafting appealing descriptions for ordinary image demands imaginative efforts to discover or create intriguing connections between words to image contents. To address this gap, we introduce AppealImage, a large-scale dataset consisting of ordinary images paired with appealing descriptions. AppealImage allows us to define four distinct tasks with quantitative metrics to enable objective evaluation. Subsequently, we propose CharmNet, an innovative framework designed to generate appealing descriptions for ordinary images. CharmNet combines instruction tuning with heuristic active learning, guided by a referee model. Experimental results demonstrate that CharmNet outperforms the state-of-the-art method by 11.4\% in generating appealing descriptions. Furthermore, CharmNet delivers impressive performance across various creative applications, including visual storytelling and situational dialogue generation. These results highlight CharmNet's potential to enhance social media engagement and to empower strong brand presence in competitive markets.
Poster
Xinyu Hou · Zongsheng Yue · Xiaoming Li · Chen Change Loy
[ Exhibit Hall I ]
Abstract
In this work, we show that we only need **a single parameter $\omega$** to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model’s reverse process. This simple approach does not require model retraining or architectural modifications and incurs negligible computational overhead, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying $\omega$ values can be applied to achieve region-specific or timestep-specific granularity control. External control signals or reference images can guide the creation of precise $\omega$ masks, allowing targeted granularity adjustments. Despite its simplicity, the method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code will be made publicly available.
Poster
Chen Liu · Tobias Ritschel
[ Exhibit Hall I ]
Abstract
We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective of combining two aspects:The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process.Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors.Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training.We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a baseline conditional diffusion but with higher speed, i.e., fewer ODE solver steps.
Poster
Moayed Haji-Ali · Willi Menapace · Aliaksandr Siarohin · Ivan Skorokhodov · Alper Canberk · Kwot Sin Lee · Vicente Ordonez · Sergey Tulyakov
[ Exhibit Hall I ]
Abstract
We propose AV-Link , a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves substantial improvement in audio-video synchronization, outperforming more expensive baselines such as the MovieGen video to audio model.
Poster
Jiaqi Han · Haotian Ye · Puheng Li · Minkai Xu · James Zou · Stefano Ermon
[ Exhibit Hall I ]
Abstract
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.
Poster
Chieh-Yun Chen · Min Shi · Gong Zhang · Humphrey Shi
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (multi-modal) large language models (LLMs) to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17\% at only …
Poster
Hengjia Li · Haonan Qiu · Shiwei Zhang · Xiang Wang · Yujie Wei · Zekun Li · Yingya Zhang · Boxi Wu · Deng Cai
[ Exhibit Hall I ]
Abstract
The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed $\textbf{PersonalVideo}$, that applies a mixture of reward supervision on synthesized videos instead of the simple reconstruction objective on images. Specifically, we first incorporate identity consistency reward to effectively inject the reference's identity without the tuning-inference gap. Then we propose a novel semantic consistency reward to align the semantic distribution of the generated videos with the original T2V model, which preserves its dynamic and semantic following capability during the identity injection. With the non-reconstructive reward training, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image.Extensive experiments …
Poster
Zhixiang Guo · Siyuan Liang · Aishan Liu · Dacheng Tao
[ Exhibit Hall I ]
Abstract
Diffusion models have attracted significant attention due to its exceptional data generation capabilities in fields such as image synthesis. However, recent studies have shown that diffusion models are vulnerable to copyright infringement attacks, where attackers inject strategically modified non-infringing images into the training set, inducing the model to generate infringing content under the prompt of specific poisoned captions. To address this issue, we first propose a defense framework, CopyrightShield, to defend against the above attack. Specifically, we analyze the memorization mechanism of diffusion models and find that attacks exploit the model’s overfitting to specific spatial positions and prompts, causing it to reproduce poisoned samples under backdoor triggers. Based on this, we propose a poisoned sample detection method using spatial masking and data attribution to quantify poisoning risk and accurately identify hidden backdoor samples. To further mitigate memorization of poisoned features, we introduce an adaptive optimization strategy that integrates a dynamic penalty term into the training loss, reducing reliance on infringing features while preserving generative performance. Experimental results demonstrate that CopyrightShield significantly improves poisoned sample detection performance across two attack scenarios, achieving average F1-scores of 0.665, retarding the First-Attack Epoch (FAE) of 115. 2% and decreasing the Copyright Infringement Rate (CIR) …
Poster
Zhanzhou Feng · Qingpei Guo · Xinyu Xiao · Ruihan Xu · Ming Yang · Shiliang Zhang
[ Exhibit Hall I ]
Abstract
Existing video generation strategies can be categorized into two categories, i.e., the diffusion and autoregressive (AR) methods. While AR methods achieves high efficiency by predicting the next token based on known visual cues, they generally fall short of diffusion models in terms of video quality. To bridge this gap, this paper introduces a novel continuous-domain next-set prediction strategy.Our approach groups related tokens to be generated into one single set, and simultaneously predicts their probability distributions, thereby better exploiting their spatial and temporal dependencies. Specifically, we propose two token partitioning strategies: Spatial Progressive Partitioning for image tokens and Temporal Next-Frame Partitioning for video tokens. Additionally, we construct a denoising sampler to generate outputs from the token set distribution within a continuous domain. This method unifies image and video generation under a cohesive next-set prediction framework.Experimental results indicate that our method achieves video quality comparable to recent diffusion models, while significantly reducing inference costs. Notably, our method surpasses the recent next token prediction approach Emu3, in video quality despite using approximately 90\% fewer parameters. Visualizations further confirm the effectiveness of our method in capturing intricate details and movements, such as water droplets and facial expressions.All implementations will be released.
Poster
Jeremy Styborski · Mingzhi Lyu · Jiayou Lu · Nupur Kapur · Adams Kong
[ Exhibit Hall I ]
Abstract
Poisoning attacks pose significant challenges to the robustness of diffusion models (DMs). In this paper, we systematically analyze when and where poisoning affects textual inversion, a widely used personalization technique for DMs. We first introduce Semantic Sensitivity Maps (SSM), a novel method for visualizing the influence of poisoning on text embeddings. Second, we identify and experimentally verify that DMs exhibit non-uniform learning behavior across timesteps, focusing on lower-noise samples. Poisoning attacks inherit this bias and inject adversarial signals predominantly at lower timesteps. Third, we find that adversarial signals distract DM learning away from relevant regions within training data, ultimately degrading textual inversion quality. Based on these insights, we propose Safe-Zone Training (SZT), a novel defense mechanism comprised of 3 key components: (1) JPEG compression to weaken high-frequency poison signals, (2) restriction to higher timesteps during textual inversion training to avoid adversarial signals at lower timesteps, and (3) loss masking to constrain learning to relevant regions. Extensive experiments across multiple poisoning methods demonstrate that SZT significantly enhances the robustness of textual inversion against all poisoning attacks, improving average DINOv2 similarity across poisons to 0.43, compared to prior published defenses at 0.26. We will publish code and datasets upon acceptance.
Poster
Haitam Ben Yahia · Denis Korzhenkov · Ioannis Lelekas · Amir Ghodrati · Amir Habibian
[ Exhibit Hall I ]
Abstract
Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized image-to-video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce the computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schemas to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, can generate latents for a 14 x 512 x 256 px clip in 1.7 seconds on a Xiaomi-14 Pro, with negligible quality loss.
Poster
Goker Erdogan · Nikhil Parthasarathy · Catalin Ionescu · Drew Hudson · Alexander Lerchner · Andrew Zisserman · Mehdi S. M. Sajjadi · Joao Carreira
[ Exhibit Hall I ]
Abstract
We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.
Poster
Kyle Sargent · Kyle Hsu · Justin Johnson · Li Fei-Fei · Jiajun Wu
[ Exhibit Hall I ]
Abstract
Since the advent of popular visual generation frameworks like VQGAN and Latent Diffusion Models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. In this work, we propose FlowMo, a transformer-based diffusion autoencoder. FlowMo achieves a new state-of-the-art for image tokenization at multiple bitrates. We achieve this without using convolutions, adversarial losses, spatially-aligned 2D latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. We conduct extensive analysis and ablations, and we additionally train generative models atop the FlowMo tokenizer and verify the performance. We will release our code and model checkpoints upon acceptance.
Poster
Zewei Xin · Qinya Li · Chaoyue Niu · Fan Wu · Guihai Chen
[ Exhibit Hall I ]
Abstract
Large text-to-image models demonstrate impressive generation capabilities; however, their substantial size necessitates expensive cloud servers for deployment. Conversely, light-weight models can be deployed on edge devices at lower cost but often with inferior generation quality for complex user prompts. To strike a balance between performance and cost, we propose a routing framework, called RouteT2I, which dynamically selects either the large cloud model or the light-weight edge model for each user prompt. Since generated image quality is challenging to measure and compare directly, RouteT2I establishes multi-dimensional quality metrics, particularly, by evaluating the similarity between the generated images and both positive and negative texts that describe each specific quality metric. RouteT2I then predicts the expected quality of the generated images by identifying key tokens in the prompt and comparing their impact on the quality. RouteT2I further introduces the Pareto relative superiority to compare the multi-metric quality of the generated images. Based on this comparison and predefined cost constraints, RouteT2I allocates prompts to either the edge or the cloud. Evaluation reveals that RouteT2I significantly reduces the number of requesting large cloud model while maintaining high-quality image generation.
Poster
Joonghyuk Shin · Alchan Hwang · Yujin Kim · Daneul Kim · Jaesik Park
[ Exhibit Hall I ]
Abstract
Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MM-DiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT’s behavioral patterns.
Poster
Woo Kyoung Han · Yongjun Lee · Byeonghun Lee · Sang Hyun Park · Sunghoon Im · Kyong Hwan Jin
[ Exhibit Hall I ]
Abstract
Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence.In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding's protocol. The source code and demo files are provided in the supplementary material.
Poster
Yuxuan Zhang · Yirui Yuan · Yiren Song · Haofan Wang · Jiaming Liu
[ Exhibit Hall I ]
Abstract
Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model’s weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, …
Poster
Chenghu Du · Shengwu Xiong · Yi Rong
[ Exhibit Hall I ]
Abstract
Current virtual try-on methods primarily enhance performance through network optimization, like coarse-to-fine structures and referenceNet for clothing information injection. However, limited sample quantity and diversity restrict their improvement. To overcome this, we present a unified mask-free virtual try-on framework. It utilizes diffusion models' inherent properties to boost each pipeline part's ability to deeply fit the target sample distribution, thus improving performance. On the input side, our proposed text-driven pseudo-input preparation approach increases the diversity of clothing-agnostic regions in person pseudo-samples. This prompts the generator to focus more on variations in these areas and improves the model's generalization ability. At the generator, we propose gated manipulation to prevent weight forgetting and cut training costs, and introduce texture-aware injection to explicitly add human-perceptible clothing texture info. For inference, our proposed refining conditional inference approach reduces Gaussian noise randomness, thus preserving identity information and clothing details in results. Extensive experiments demonstrate our method outperforms previous virtual try-on methods.
Poster
Sixian Zhang · Xinyao Yu · Xinhang Song · Yiyao Wang · Shuqiang Jiang
[ Exhibit Hall I ]
Abstract
Object goal navigation requires an agent to navigate to a specified target in unseen environments without an explicit map, which demands an understanding of object-scene contextual relationships to infer the target's location based on partial observations.The function of an object plays a crucial role in its categorization and naming. Analyzing an object's functional role within a given scene enhances the understanding of its contextual relationships, thereby aiding in goal inference. In this paper, we propose the function-centric bayesian Network (FBN) for the zero-shot ObjectNav task.FBN is designed to uncover the functions that observed objects afford individually or collaboratively with other objects, as well as the functional semantics contained within the observed scenes. The probabilistic directed edges in FBN describe the object-function and scene-function relationships, which are derived by prompting LLMs with the proposed CounterfactCoT. CounterfactCoT determines existence and probability of edgs, by guiding LLMs to compare the impact of an edge’s existence or absence on the surrounding context.Leveraging FBN with Bayesian inference, the probability of each function group and probability map of goal occurance are computed. Then the waypoint is selected based on obtained probability map. Experiments on MP3D and HM3D demonstrate that FBN effectively captures object-scene-function relationships and improves …
Poster
zihang zou · Boqing Gong · Liqiang Wang
[ Exhibit Hall I ]
Abstract
In this paper, we highlight a critical threat posed by emerging neural models—data plagiarism. We demonstrate how modern neural models (\eg, diffusion models) can effortlessly replicate copyrighted images, even when protected by advanced watermarking techniques. To expose the vulnerability in copyright protection and facilitate future research, we propose a general approach regarding neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on ``anchors and shims'', employs inverse latents as anchors and finds shim perturbations that can gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbation to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modifications in copyrighted images, making it to bypass protections ranging from visible trademarks, signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Empirical experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.
Poster
Beier Zhu · Ruoyu Wang · Tong Zhao · Hanwang Zhang · Chi Zhang
[ Exhibit Hall I ]
Abstract
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our EPD-Solver in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 5.26 on CIFAR-10, 8.74 on FFHQ, 7.95 on ImageNet, and 7.79 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin.
Poster
Yu Lei · Bingde Liu · Qingsong Xie · Haonan Lu · Zhijie Deng
[ Exhibit Hall I ]
Abstract
Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence.In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state and hence yields more reasonable corrections. Nevertheless, naive lookahead VSD may suffer from unstable training in practice due to the potential over-fitting. To address this, we propose to use a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ($L^2$-VSD). $L^2$-VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries. Extensive experiments validate the efficacy of $L^2$-VSD, …
Poster
Yuanhe Guo · Linxi Xie · Zhuoran Chen · Kangrui Yu · Ryan Po · Guandao Yang · Gordon Wetzstein · Hongyi Wen
[ Exhibit Hall I ]
Abstract
We propose a dataset to enable the study of generative models that understand fine-grained individual preferences.We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K different users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. Our dataset enables a set of applications. With aggregate-level user preferences from our dataset, we were able to train better preference alignment models. In addition, leveraging individual-level user preference, we benchmark the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation and highlight the space for improvements. Finally, we demonstrate that our dataset enables, for the first time, a generative model personalization paradigm by editing customized diffusion models in a latent weight space to align with individual user preferences.
Poster
Kien Nguyen · Anh Tran · Cuong Pham
[ Exhibit Hall I ]
Abstract
The rapid growth of text-to-image diffusion models has raised concerns about their potential misuse in generating harmful or unauthorized contents. To address these issues, several Concept Erasure methods have been proposed. However, most of them fail to achieve both completeness, i.e., the ability to entirely remove the target concept, and effectiveness, i.e., maintaining image quality. While few recent techniques successfully achieve these goals for NSFW concepts, none could handle narrow concepts such as copyrighted characters or celebrities. Erasing these narrow concepts is critical in addressing copyright and legal concerns. However, erasing them from diffusion models is challenging due to their close distances to non-target neighboring concepts, requiring finer-grained manipulation. In this paper, we introduce Subspace Mapping (SuMa), a novel method specifically designed to achieve both completeness and effectiveness in easing these narrow concepts. SuMa first derives a target subspace representing the concept to be erased and then neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. This mapping ensures the target concept is fully erased while preserving image quality. We conduct extensive experiments with SuMa across four tasks: subclass erasure, celebrity erasure, artistic style erasure, and instance erasure and compare the results with …
Poster
Alessio Spagnoletti · Jean Prost · Andres Almansa · Nicolas Papadakis · Marcelo Pereyra
[ Exhibit Hall I ]
Abstract
Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug \& Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image …
Poster
Philipp Becker · Abhinav Mehrotra · Ruchika Chavhan · Malcolm Chadwick · Luca Morreale · Mehdi Noroozi · Alberto Gil Couto Pimentel Ramos · Sourav Bhattacharya
[ Exhibit Hall I ]
Abstract
Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images.However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs).First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated.Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts.Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT).We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-$\Sigma$ (conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to $2.2\times$ speedup with comparable image quality after distillation.
Poster
Yiting Qu · Ziqing Yang · Yihan Ma · Michael Backes · Savvas Zannettou · Yang Zhang
[ Exhibit Hall I ]
Abstract
Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions---visual tricks that create different perceptions of reality.However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities.In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models.Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages.Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset.Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions.Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs.We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages.To address such risks, we demonstrate that preprocessing transformations combining Gaussian blur and histogram equalization can substantially enhance moderation performance.
Poster
Junyu Chen · Dongyun Zou · Wenkun He · Junsong Chen · Enze Xie · Song Han · Han Cai
[ Exhibit Hall I ]
Abstract
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and the latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. We will release our pre-trained models and code upon publication.
Poster
Shijie Zhou · Ruiyi Zhang · Huaisheng Zhu · Branislav Kveton · Yufan Zhou · Jiuxiang Gu · Jian Chen · Changyou Chen
[ Exhibit Hall I ]
Abstract
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations.In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and reinforcement learning in text-to-image generation.
Poster
Gao Zong lin · Huu-Tai Phung · Yi-Chen Yao · Kuan-Wei Ho · Yi-Hsin Chen · Yu-Hsiang Lin · Alessandro Gnutti · Wen-Hsiao Peng
[ Exhibit Hall I ]
Abstract
This work, termed MH-LVC, presents a multi-hypothesis temporal prediction scheme that employs long- and short-term reference frames in a conditional residual video coding framework. Recent temporal context mining approaches to conditional video coding offer superior coding performance. However, the need to store and access a large amount of implicit contextual information extracted from past decoded frames in decoding a video frame poses a challenge due to excessive memory access. Our MH-LVC overcomes this issue by storing multiple long- and short-term reference frames but limiting the number of reference frames used at a time for temporal prediction to two. Our decoded frame buffer management allows the encoder to flexibly utilize the long-term key frames to mitigate temporal cascading errors and the short-term reference frames to minimize prediction errors. Moreover, our buffering scheme enables the temporal prediction structure to be adapted to individual input videos. While this flexibility is common in traditional video codecs, it has not been fully explored for learned video codecs. Extensive experiments show that the proposed method outperforms VTM-17.0 under the low-delay B configuration in terms of PSNR-RGB across commonly used test datasets, and performs comparably to the state-of-the-art learned codecs (e.g.~DCVC-FM) while requiring less decoded frame buffer …
Poster
Li · Yang Xiao · Jie Ji · Kaiyuan Deng · Bo Hui · Linke Guo · Xiaolong Ma
[ Exhibit Hall I ]
Abstract
Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose **Dynamic Mask coupled with Concept-Aware Loss**, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our **Dynamic Mask** mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our **Concept-Aware Loss** explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.
Poster
Luca Bartolomei · Enrico Mannocci · Fabio Tosi · Matteo Poggi · Stefano Mattoccia
[ Exhibit Hall I ]
Abstract
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs.Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach using synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance