Poster
Visual Relation Diffusion for Human-Object Interaction Detection
Ping Cao · Yepeng Tang · Chunjie Zhang · Xiaolong Zheng · Chao Liang · Yunchao Wei · Yao Zhao
Human-object interaction (HOI) detection fundamentally relies on capturing fine-grained visual information to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. We aim to bridge this gap by leveraging generative models’ fine-grained visual perception to enhance HOI detection through improved visual relation representation learning. In this work, we propose a Visual Relation Diffusion model (VRDiff) for HOI detection, which introduces dense visual relation conditions. Considering that diffusion models primarily focus on instance-level objects, we design an interaction-aware condition representation that learns relation features with spatial responsiveness and contextual interaction cues. Instead of relying on text conditions, VRDiff leverages learned visual relation representations as conditions for the diffusion model. Furthermore, we refine the visual relation representations through generative feedback from the text-to-image diffusion model, enhancing HOI detection performance without requiring image generation. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves state-of-the-art performance under both standard and zero-shot HOI detection settings.
Live content is unavailable. Log in and register to view live content