Contexte

These sources introduce BrowseComp, a rigorous benchmark developed by OpenAI to evaluate the persistence and creativity of AI browsing agents. Unlike older tests that focused on easily retrievable data, BrowseComp features over 1,200 complex, human-verified questions that require multi-step reasoning and exhaustive internet navigation to solve. Anthropic utilizes this benchmark in its system card for Claude Opus 4.6, positioning it alongside other high-level assessments of agentic safety and reasoning. The data reveals that OpenAI’s Deep Research and Claude’s thinking modes significantly outperform standard models, particularly as test-time compute increases. Ultimately, the documents illustrate a shift toward measuring an AI’s ability to handle entangled information and professional-grade tasks in fields like finance and software engineering.

Chapitres

  • 0:00 — Introduction
  • 0:35 — Le défi impossible
  • 1:49 — Exemple concret plastiquman
  • 2:20 — Test Bros Camp
  • 3:33 — Claude Opus challenger

Sources

Voir les 15 sources restantes