Closing the Loop Between Vision and Language (Decade Mark)
Mohamed Elhoseiny, Angel Chang, Anna Rohrbach, Marcus Rohrbach, Xin Eric Wang, Krishna Kumar, Kilichbek Haydarov, Eslam Abdelrahman, Austin Wang, Yiming Zhang, Tobias Wieczorek, Qianqi (Jackie) Yan
Abstract
This workshop explores the intersection of Computer Vision and NLP, focusing on joint vision-language understanding. Recent advances, particularly in large-scale multimodal pretraining with transformers, have driven progress in various tasks. Topics include visual-linguistic representation learning, VQA, captioning, visual dialog, referring expressions, vision-and-language navigation, embodied QA, and text-to-image generation. We emphasize joint video-language understanding due to its unique challenges. Additionally, we welcome critical work on dataset and algorithmic bias, generalization issues, and efforts toward transparency and explainability.
Video
Chat is not available.
Successful Page Load