VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

TLDR

An extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models are conducted, identifying several limitations of text-only LLM agents, and revealing gaps in the capabilities of state-of-the-art multimodal language agents.