Contexte
These sources introduce BrowseComp, a rigorous benchmark developed by OpenAI to evaluate the persistence and creativity of AI browsing agents. Unlike older tests that focused on easily retrievable data, BrowseComp features over 1,200 complex, human-verified questions that require multi-step reasoning and exhaustive internet navigation to solve. Anthropic utilizes this benchmark in its system card for Claude Opus 4.6, positioning it alongside other high-level assessments of agentic safety and reasoning. The data reveals that OpenAI’s Deep Research and Claude’s thinking modes significantly outperform standard models, particularly as test-time compute increases. Ultimately, the documents illustrate a shift toward measuring an AI’s ability to handle entangled information and professional-grade tasks in fields like finance and software engineering.
Chapitres
0:00— Introduction : L’IA révolutionnaire0:33— L’anomalie browse comp1:38— Le test browse comp2:45— Performance humaine vs IA4:05— L’écart grandissant
Sources
- Anthropic to Google: Who’s winning against AI hallucinations? - AI News
- Anthropic’s Claude Bots Make Robots.txt Decisions More Granular - Search Engine Journal
- Anthropic’s Claude Opus 4.6 saw through an AI test, cracked the …
- BrowseComp-Plus Benchmark Overview - Emergent Mind
- BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent | OpenReview
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents - OpenAI
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents - ResearchGate
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents - arXiv.org
- Can AI Do Strategy? - PubsOnLine - INFORMS.org
- Claude 4.6 Outsmarts the Test Bench : r/AIGuild - Reddit
- Claude AI Web Search Explained: Availability, Features, and How to Use It in 2025
- Claude Opus 4.6 Introduces Adaptive Reasoning and Context Compaction for Long-Running Agents - InfoQ
- Claude Opus 4.6 System Card - Anthropic
- Claude Opus 4.6 System Card - Anthropic
- Claude Opus 4.6 System Card - Anthropic
Voir les 15 sources restantes
- Claude in enterprise: case studies of successful AI deployments - Data Studios
- Claude web search explained - Profound
- Company: anthropic | AINews
- Constitutional AI: An Expanded Overview of Anthropic’s Alignment Approach - Zenodo
- During testing, Claude realized it was being tested, found an answer key, then built software to hack it : r/ClaudeAI - Reddit
- Eval awareness in Claude Opus 4.6’s BrowseComp … - Anthropic
- GPT-5.4 vs Claude Opus 4.6: In-depth comparison of 2026 flagship AI models, with OpenClaw agent real-world test data
- Geoffrey Hinton Warns AI Will Replace Many More Jobs by 2026 - Stan Ventures
- Global AI Industry Recap: February 23, 2026 — A Da… - U深研 - UniFuncs
- How AI Will Disrupt Strategy Before It Disrupts Execution - Unaligned Newsletter
- How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark - WinBuzzer
- How to Use Claude.ai’s Research Toggle Inside Claude Code - DEV Community
- Introducing web search on the Anthropic API - Claude
- Issues | AINews
- Substack notifies users about a “limited” data breach in October 2025 via a now-patched flaw found on February 3; a threat actor leaked a ~697K-record database (Sergiu Gatlan/BleepingComputer) - Techmeme
