Poster Wed, Oct 22, 2025 • 5:45 PM – 7:45 PM PDT Exhibit Hall I #333

SUV: Suppressing Undesired Video Content via Semantic Modulation Based on Text Embeddings

Xiang Lv · Mingwen Shao · Lingzhuang Meng · Chang Liu · Yecong Wan · Xinyuan Chen

[ Poster]

Abstract

Recently, text-driven diffusion models have significantly promoted the development of video editing. However, there still remain two practical challenges: (1) existing text-to-video editing methods struggle to understand negative text prompt, resulting in ineffective suppression of undesirable content in edited video; (2) these methods are difficult to maintain the temporal consistency of the edited video, leading to inter-frame flickering. To address the above challenges, we propose SUV, a novel semantic modulation method based on text embeddings to suppress undesired content in the edited video. Specifically, on the one hand, we discover that the end embeddings (EE) contain substantial coupled positive and negative embeddings, which is the primary reason for the appearance of undesirable content in the edited video. Based on this discovery, we advocate decoupling the negative embeddings from the EE by employing singular value decomposition and propose an exponential suppression operator to decrease the singular values of negative embeddings, thereby restraining the effect of negative embeddings on the edited video content. Subsequently, two constraints are designed to further suppress negative content while keep positive content unchanged via pushing negative embeddings apart and pulling positive embeddings closer. On the other hand, to boost the temporal consistency of edited video, we devise a fuzzy feature selection strategy to fuse similar features in different frame for avoiding inter-frame flickering. Benefiting from the above elaborate designs, our method not only effectively suppresses undesired content of video, but also maintains inter-frame consistency. Extensive experiments demonstrate that our SUV significantly improves edit accuracy and temporal consistency of edited videos compared to existing methods.

Chat is not available.