Skip to yearly menu bar Skip to main content


Poster

UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents

Harsh Agrawal · Eldon Schoop · Xinlei Pan · Ari Seff · Anuj Mahajan · Di Feng · Ruijia Cheng · Andres Romero Mier y Teran · Esteban Gomez · Abhishek Sundararajan · Forrest Huang · Amanda Swearngin · Mohana Moorthy · Jeffrey Nichols · Alexander Toshev


Abstract:

We build a comprehensive online evaluation benchmark for language-conditioned multi-step task execution on mobile interfaces. Our benchmark strives to evaluate the multi-step planning, reasoning, and visual grounding capabilities of agents, using mobile user interfaces as a concrete testbed. To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decision-making abilities including multi-step planning, visual perception, action grounding, and using memory or external knowledge. We also highlight important factors such as statefulness, safety, and evaluation complexity that are key to design tasks that can be reliably evaluated. Using this taxonomy, we design 116 tasks across 36 unique apps. Through an automatic framework, we stage and evaluate several natural baselines with different input representations and planning strategies. We show that the best-performing agent achieves 40% success on our benchmark. We further measure agents' abilities to plan, ground, and utilize world knowledge highlighting areas of improvement.

Live content is unavailable. Log in and register to view live content