Poster
RegionFocus: Visual Test-time Scaling for GUI Agents
Tiange Luo · Lajanugen Logeswaran · Justin Johnson · Honglak Lee
We introduce RegionFocus, a visual test-time scaling approach that enhances GUI-based AI agents by leveraging visual cues to navigate the complexity of modern web interfaces. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving action accuracy without relying on extensive text-based reasoning. To support this process, we propose an image-as-history mechanism that visualizes key landmarks at each step, providing a transparent action record and enabling the agent to effectively choose among action candidates.Even with a simple region selection strategy, we observe significant performance gains of 31.7\% on Screenspot-pro and 34.9\% on WebVoyager benchmarks on top of a state-of-the-art open Vision Language Model Agent, highlighting the effectiveness of visual test-time scaling in interactive settings.Our code will be released publicly.
Live content is unavailable. Log in and register to view live content