Poster
Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions
Mengyu Yang · Yiming Chen · Haozheng Pei · Siddhant Agarwal · Arun Vasudevan · James Hays
Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly responsible. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. Our model enforces object-awareness by using a slot attention visual encoder. We then develop an automatic method to compute segmentation masks of the objects involved to guide the model's focus towards the most informative regions of the interaction. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
Live content is unavailable. Log in and register to view live content