Skip to yearly menu bar Skip to main content


Poster

AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs

Sanjoy Chowdhury · Hanan Gani · Nishit Anand · Sayan Nag · Ruohan Gao · Mohamed Elhoseiny · Salman Khan · Dinesh Manocha


Abstract:

Recent advancements in reasoning optimization havegreatly enhanced the performance of large language models(LLMs). However, existing work fails to address the com-plexities of audio-visual scenarios, underscoring the needfor further research. In this paper, we introduce AURE-LIA, a novel actor-critic based audio-visual (AV) reasoningframework that distills structured, step-by-step reasoninginto AVLLMs at test time, improving their ability to processcomplex multi-modal inputs without additional training orfine-tuning. To further advance AVLLM reasoning skills, wepresent AVReasonBench, a challenging benchmark compris-ing 4500 audio-visual questions, each paired with detailedstep-by-step reasoning. Our benchmark spans six distincttasks, including AV-GeoIQ, which evaluates AV reasoningcombined with geographical and cultural knowledge. Evalu-ating 18 AVLLMs on AVReasonBench reveals significant lim-itations in their multi-modal reasoning capabilities. UsingAURELIA, we achieve up to a 100% relative improvement,demonstrating its effectiveness. This performance gain high-lights the potential of reasoning-enhanced data generationfor advancing AVLLMs in real-world applications. Our codeand data will be publicly released.

Live content is unavailable. Log in and register to view live content